Data Eng Weekly

Hadoop Weekly Issue #35

15 September 2013

There's a ton of great content in this week's issue riding on the wave of the recent Hadoop 2.1.0-beta. In addition, this issue contains a number of walkthroughs and tutorials -- there should be plenty to keep you buys if you're just getting started or looking to try out new things.


Gwen Shapira from Cloudera presented at Surgecon on ETL with Hadoop. The talk starts with the motivation for moving ETL to Hadoop (things like cost and flexibility), then moves onto best practices and tools for ETL on Hadoop. It covers low-level details like file system layout as well as the available tools (e.g. Sqoop and Oozie) and how to scale them.

Parquet is a relatively new columnar file format for large datasets. The format, which is partially based-upon the format discussed in Google's Dremel paper, uses some interesting tricks to store nulls and repeated fields. A recent post walks through the low-level details of how the format serializes data efficiently. Complete with examples and illustrations, the post makes a complicated topic easily accessible.

This post covers writing and testing simple (supporting only java primitive) UDFs and generic UDFs for Apache Hive. Generic UDFs, which are required if you want to support arguments that are complex types (e.g. List, Map), have a complicated API that takes some time to become familiar with. That also makes the API somewhat complex to test, but the post covers each of these in detail with accompanying code available on github.

The Hortonworks blog has some more details on Apache Tez, the framework which aims to generalize the MapReduce paradigm. The post covers the motivation (i.e. better performance and flexibility by offering more primitives vs. MapReduce) as well as some details on the overall design. In particular, you'll learn about the logical plan modeling (akin to a database query plan), the pluggability of input/output formats (including intermediate outputs), dynamic reconfiguration (changing the query plan at runtime), and the plans for resource management.

As a follow-up to last week's release of Spring for Hadoop 2.0 M1, the Spring blog has an intro to the new YARN features in the release. The post introduces the YarnClient, YarnAppmaster, and YarnContainer interfaces and their roles in a YARN application lifecycle. There's some simple example code, a few configuration examples, and instructions for submitting code via the command-line.

Revolution R Enterprise 7 will support Hadoop as a backend for RevoScaleR, and this post gives an overview of what that interaction is going to look like. The quick summary is that there's not a whole lot to change, other than configuring connections to HDFS and the MapReduce cluster. Once that's done, the summary and logistic regression functions just work -- the latter spawning a number of MapReduce jobs.

Lipstick, the Pig workflow visualization software, was recently integrated into the Mortar product. As part of this work, the Mortar team wrote chef recipes to configure a Lipstick service (and all dependencies, such as MySQL). Combined with Vagrant, these tools allow for a single-command startup of a VM running Lipstick. A post on the Mortar blog covers this system as well as some of the changes/features that Mortar worked on in order to bring Lipstick to their platform.

The Cloudera blog has an overview of testing HBase client code. The post covers vanilla unit testing, mock testing with Mockito, integration testing with HBaseTestingUtility, and testing of HBase MapReduce code with MRUnit. There's a lot of good stuff even for someone who's been using HBase for a while, and all of the code is available on github.

Given the major framework and architecture changes in Hadoop 2.x due to YARN, there are a whole new set of configuration options to consider. The post considers an example hardware setup and walks through all of the YARN and MapReduce 2 config options. Rather than configuring slots-per-node, all configuration is done via slices of memory, and there are some subtle points such as configuring the java heap size to not outgrow the container memory allocation.

Before executing a Pig script, the Pig runtime converts the script to a LogicalPlan that describes the data flow. This data isn't readily exposed via an API (Pig 0.12 will include a fix), but it can be accessed with a bit of work. This posts includes an example, written in JRuby, of subclassing the PigServer class to make the LogicalPlan available programmatically.


SAP made a number of announcements at TechCrunch Disrupt, including an expanded partnership with Intel and Hortonworks to resell and support the Intel and Hortonworks Distributions. In addition, they announced a new product aimed at manufacturers, the creation of a new Data Science organization at SAP, and the "Big Data Geek Challenge."

Hadoop Summit Europe 2014 is being held in April in Amsterdam, and the call for abstracts is open through October 31st. There are two new tracks for this years conference -- a committer track and a new "unconference" track. The other tracks are "The Future of Apache Hadoop," "Data Science & Hadoop," "Hadoop Deployment & Operations," and "Hadoop for Business Applications and Development."


Cloudera manager 4.7.1 was released to hot fix a bug preventing HUE configurations from being opened.


Curated by Mortar Data ( )

Monday, September 16
Hadoop Development Kits : Weave and Cloudera CDK (San Jose, CA)

Tuesday, September 17
Real-time stream processing platform on Hadoop (Santa Clara, CA)

Twitter, Storm, Clojure... (London, UK)

HBase - A Technical Introduction (Toronto)

Wednesday, September 18
Hadoop Based Machine Learning (Austin, TX)

Data infrastructure at Spotify (Berlin)

Thursday, September 19
Hadoop Ops (Sunnyvale, CA)