Data Eng Weekly

Hadoop Weekly Issue #52

12 January 2014

This issue marks a complete year (well, 52 weeks) of Hadoop Weekly. Thanks to all the authors and writers for generating so much content—without it this newsletter couldn’t exist. And I’d also like to thank all the subscribers for spreading the word! It’s pretty awesome that over 1500 people receive this weekly email.

With that said, there’s a great issue below. SQL-on-Hadoop continues to heat up with updates on the stinger initiative, Presto-as-a-Service from Qubole, and IBM’s BigSQL. There are also some updates on writing YARN applications, and there are a handful of releases / new projects to check out.


Cloudera employee and former Oracle DBA, Gwen Shapira, has written an FAQ for Oracle DBAs thinking about learning Hadoop. The post covers four questions, including some tips for important skills if you don’t have any Hadoop experience but are looking for a Hadoop job.

The Hortonworks blog has an update on the Stinger initiative, whose aim it is to make Hive 100x faster. Work is being wrapped up on phase 3 (out of 3) of the initiative. The most recent improvements include an improved integration with Tez, usage of the Tez Service, and a vectorized query engine. These improvements greatly increase the speed of the Hive and provide interactive querying via the Tez Service.

HDFS needs to know about the network topology in order to place blocks across racks (to survive a rack failure). This post details how it’s implemented via a user-provided script, and it includes examples of how the script is invoked by the namenode in different scenarios.

Hadoop Committer Steve Loughran has a post on the (quite complex) process of reporting a bug or feature request on the Hadoop project. It covers the etiquette for reporting bugs and new features, a number of relevant tools, the lifecycle of an issue, and more.

Steve Loughran also has a post this week about the pros and cons of the HOYA (HBase/Accumulo on YARN) project’s architecture as a YARN application. The post also covers its usefulness as an example/reference YARN application—in particular vs the distributed shell reference implementation. The overview is a fair assessment (including a number of pros and cons) of HOYA, and it ends with the recommendation of using the Twill Apache Incubator project if you’re starting a new YARN application from scratch.

In a nice follow-up to the last post, LinuxInsider has an introduction to the Twill project. It covers the history of Twill (previous Weave) at Continuuity and on Github, as well as the problem that Twill is solving. In short, by making a number of assumptions about YARN applications to introduce a simpler API, Twill provides an abstraction that is much simpler than the YARN primitives.

The Hortonworks blog has a post about some upcoming features to support heterogenous storage devices. The post has a nice introduction to and comparison of the various types of storage (HDD, SSD, RAM), which motivates why you would want several types. It then talks about the architecture and API changes required, which are partially implemented in Hadoop trunk and are targeted for an upcoming 2.x release.

IBM’s BigSQL product supports Apache HBase as a storage backend. The system has a lot of similarities with Apache Phoenix (incubating), including secondary indexes that are updated via HBase coprocessors. This presentation contains much more information about the interface.

The MapR blog explains their stance on the flurry of SQL-on-Hadoop projects—they are supporting several of them. In particular, they support Apache Hive (with the features from phase 2 of the stinger initiative), Impala, Apache Drill, and Presto (from Facebook).


For folks using Cloudera software, Cloudera is putting out a new newsletter for developers. The newsletter runs monthly.

MapR and Savvis are partnering to offer MapR’s distribution as part of the Savvis big data solution. Savvis offers fully hosted and managed services for enterprises. Cloudera is also a Savvis partner.

Information Age has the story of the hard work (and some luck) that went into creating Cloudera. It’s pretty wild to think that if Lehman Brothers had failed 4 days earlier, Cloudera (and thus the entire Hadoop ecosystem) as we know it might be very different today.


Intel announced an alpha release of version 2.0 of the Intel Graph Builder library. Intel Graph Builder is a library for graph processing that leverages Pig for distributed computation.

Parquet, the columnar storage format for Hadoop, released version 1.3.2. The release resolves 4 issues, including a bug related to enum handling.

Qubole has announced an alpha release of their Presto-as-a-Service offering. Presto is the low-latency big data SQL solution that Facebook open-sourced a few months ago. What makes the Qubole offering unique vs. other SQL-on-Hadoop services (including Amazon’s Redshift) is that it can directly process data stored in S3. It’s also one of the first -as-a-Service offerings in this area.

Oryx-Flume is a new project that integrates Apache Flume with Cloudera Oryx. Flume is a distributed event data transfer and collection service, and Oryx provides large-scale machine learning. The project includes support for JSON events, but the interface is extensible.

The Kiji Project has released the “Dashi” BentoBox, version 1.4.0. The bundle includes new versions of KijiExpress, the Scala DSL for analyzing data in Kiji, and KijiModeling, the modularization framework for model development. KijiExpress hits version 1.0.0, signifying a stable API. Like previous releases, this one supports CDH 4.x.


Curated by Mortar Data (

Monday, January 13

Cleveland Big Data and Hadoop User Group (Cleveland, OH)

Get started on Hadoop: Hands-on Interactive Boot Camp (San Jose, CA)

Distributed Online Algorithms for Real-time Scalable Big Data on Elastic Cloud (Mountain View, CA)

Tuesday, January 14

Intro to Hadoop: Hype or Reality - you decide with Kevin Crocker (San Francisco, CA)

'The Future of Data' by Doug Cutting, Founder of Hadoop (Palo Alto, CA)

Wednesday, January 15

January HUG Meetup (Pittsburgh, PA)

Big Real-Time Learning and Applying Deep Learning (Amsterdam, Netherlands)

Vermont Hadoop Meetup (Burlington, VT)

Hadoop Based Machine Learning (Austin, TX)

Thursday, January 16

Stores, Monoids and Dependency Injection - Abstractions for Spark Streaming Jobs (San Francisco, CA)

Tree and Graph Processing on Hadoop (Laurel, MD)

Friday, January 17

Top 10 considerations for running Hadoop in Production (Hyderabad, India)