Data Eng Weekly

Issue #4

10 February 2013

Welcome to the fourth issue of Hadoop Weekly! I was pretty heads down at work this week, but hopefully I've captured all the key news from the week. I heard a few pieces subscribers give me positive feedback about the new format with more commentary and less breadth, so I'm doing that again this week. Feel free to let me know if you have an opinion by tweeting @joecrobak.

Technical has profiled Facebook's data team. The article is a good overview of many of the interesting things that Facebook has done with Hadoop, including Corona (Facebook's next-gen Hadoop scheduler), Prism (transparent cross-datacenter replication of HDFS), and Peregrine (a real-time processing tool). Facebook has traditionally been very good about contributing their improvements back to the open-source community, so it will be really interesting to see what happens with these project. Corona is already open-source, but is based on a old branch of Hadoop and integrating this and other improvements will probably be a big technical challenge.

The Hive PMC has voted to adopt HCatalog as a submodule of HCatalog by overwhelming majority. The HCatalog PPMC also passed the measure. For those that aren't familiar, Hive is a SQL interface to MapReduce and HCatalog is a collection of libraries that make Hive tables available to other frameworks by interfacing with the Hive metastore. With HCatalog, Hive data can be accessed in Pig or MapReduce via the same InputFormats and SerDes. Many HCatalog features were gated by patches to Hive in order to expose previously hive-internal functionality, so integrating the projects should help to speed up development.

Do you have many MapReduce pipelines running over the same datasets? Chances are there's some overlap between computation, although it can be tricky and cumbersome to optimize intermediate computation. Ryan Brush from Cerner talks about using Crunch to build pipelines that reuse intermediate datasets. The article has great illustrations and descriptions of the problem.

Drawn to Scale are the creators of Spire, a SQL-on-Hadoop project that's modeled after Google's F1. They've written a (somewhat biased) but informative summary of the SQL offerings for Hadoop (the mention Hive, Impala, Phoenix, and Spire). The key take-away is that there's a whole spectrum of implementations and there might be a different winner depending on your application, but no one magic bullet that's perfect for every use case.

Yahoo! has many Hadoop clusters running on over 40,000 nodes, making them one of the largest Hadoop deployments. In the past few months, they've adopted YARN and Hadoop 0.23.x. They also talk about their use of other projects in the Hadoop ecosystem like Pig, Oozie, and HCatalog. Its great to see that Hadoop at Yahoo! is still going strong even after a number of their Hadoop-committer employees left during the Hortonworks spinoff.

Joining two tables with MapReduce is both tricky and verbose. Most people tend to use high-level frameworks like Hive or Pig to join datasets, but I still recommend that developers try to implement something difficult like a join in raw MapReduce to gain understanding and respect for the challenge. This post describes doing a join in MapReduce and includes a number of great visualizations to describe what's happening at each stage.

Apache Kafka is a self-described "high-throughput distributed messaging system" that is used by a number of companies to transport data to HDFS. Version 0.8 of Kafka will introduce intra-cluster replication, a much-requested feature. Jun Rao of LinkedIn (Kafka was originally built there), describes the architecture used for this replication and the trade-offs that they made in the design. This is a great read for anyone interested in distributed systems.

Metamarkets posted an interview with Florian Leibert, a data scientist at Airbnb. Airbnb is making heavy use of Amazon Web Services for their analytics infrastructure, including S3 and Hadoop. This interview contains a bunch of interesting information about what things are like at Airbnb and Twitter (where Florian used to work).


Cloudera Released version 0.5 of Impala, their low-latency SQL system that runs on HDFS. This version brings JDBC support and a number of bug fixes. Impala is a really interesting project because it aims to provide low-latency analytics without an extra data load such is required to get data into other systems. Version 0.5 doesn't yet support columnar storage (which is expected to provide massive throughput improvements on certain queries). Before hitting their 1.0 release, Impala will add support for the Trevni columnar storage format.

Stripe has released an 'impala' gem for running Impala queries using Ruby. In addition, 'impala-herd' is a website built atop the impala gem for running queries, saving results, and sharing them. We have a similar tool for Hive internally at Foursquare, and the engineering, product, and BD teams make heavy use of it.

Cloudera has released version 4.1.3 of CDH (their distribution including Apache Hadoop). The release includes fixes for HDFS, MapReduce, HBase, and Oozie.

Apache Hadoop 0.23.6 was released. The 0.23.x version of Hadoop contains many of the same features of Hadoop 2.0.0, with the exception of the Highly Available NameNode. Yahoo! are running version 0.23.x with YARN on many of their clusters. See the Yahoo! blog post in this newsletter for more information about that.

RHadoop is a system for interfacing with Hadoop from within R. Setting up and configuring and RHadoop has a lot of prerequisites. If you just want to get to work, then you might want to try out these automation recipes (using Chef) for bringing up Hadoop, R, and RHadoop. Currently only supports Mac OS X, but plans to support other operating systems, too.


"HBase Developer PowWow down at HortonWorks"
February 19, 2013 @ 2PM
Palo Alto, CA

"Boston Hadoop User Group Meet-Up: Hadoop with Enterprise Resiliency, Interactivity, and SQL - Delivering the Answer to Enterprise Analytics"
February 21 @ 6-8pm
Cambridge, MA

"Apache Hadoop YARN meetup"
Friday, February 22, 2013 @ 2:30PM
Palo Alto, CA

"Distributed, high-throughput, persistent real-time messaging with Apache Kafka"
Madison, WI
February 26, 2013