Data Eng Weekly

Issue #3

3 February 2013

I'm trying something a little different this week and filtering out some of the lower quality articles while also giving a bit more of an overview and commentary on many of the links I'm posting. Please send feedback on this to info at or tweet me @joecrobak. Also, I find a lot of these articles on twitter, so help me find them by tweeting with #hadoop if you see something interesting to share with the community.

With that said, this issue is definitely more technical-heavy than previous posts. Hope you enjoy!


Munich OpenHUG When: Friday February 22nd, 2013 from 4PM to 8PM

YARN contributors MeetUp When: Friday February 22nd, 2013. Where: Hortonworks, Palo Alto, CA. Also online via WebEx.

Technical news and blogs

Mortar Data provides a Hadoop (Pig) as a service platform on Amazon AWS. They have a lot of cool batteries-included functionality, such as integration with Pig's illustrate command. If you're just getting started with Hadoop or are running in EMR, Mortar seems like a great framework to improve producitivity.

LinkedIn has written up a description of how they recommend "related searches" (e.g. a search for "Hadoop" brings up related terms of "Mapreduce" and "big data" to name a few). The system includes offline analysis of search queries, results, and clicks via Java MapReduce and Pig.

As mentioned last week, the Software Developer's Journal is doing an edition on Hadoop. The seem to be quite a few interesting articles, such as an introduction to writing Hive UDFs and SerDes, a case study on coordination with Zookeeper, and an article about the future of Hadoop development.

Henry Robinson (@HenryR) works on Cloudera Impala (a low-latency SQL for HDFS solution). He's put written about the goals and ideas behind columnar storage and why it's important for optimizing data access over large, sparse datasets.

Phoenix HBase is an SQL implementation for querying and inserting data into HBase from the folks at Salesforce. Their initial release includes support for creating tables (or pointing at and existing base table), inserting data, querying data, basic ALTER TABLE statements, and is JDBC compatible. The project doesn't yet support joins or multiple indexes, but these (and many other features) are on the roadmap. It looks like they're making heavy use of HBase coprocessors. With all the companies and products entering the low-latency-querying-on-HDFS space, this should be an interesting project to follow, since it makes use of HBase.

GraphChi is a graph processing framework that operates on large graphs from folks at Carnegie Mellon. It operations on a single machine, but the graph doesn't have to fit into RAM -- it makes efficient use of multiple disks. This week, support for processing graphs with GraphChi via Apache Pig was announced. This is accomplished by pushing all of the logic into a Pig Load Function, which does the computation on "load." It requires Pig 0.10, and their examples include Pagerank and matrix factorization.

HBase serves data using RegionServers which operate on regions (contiguous key-ranges) of data. In order to balance load, HBase splits and merges these regions. The Hortonworks blog gives a great overview of the different region pre-splitting and on-demand splitting strategies as well as how these are implemented.

Almost everyone begins working with text data in MapReduce -- e.g. the classic introductory MapReduce problem is WordCount, and most folks' first productionized datasets are server logs that are imported to HDFS as is. But working with text (in particular delimited data) has lots of drawbacks -- from performance implications to data sizes to working with custom parsing functions to handle user-supplied data. Eric Sammer posted a great overview of these problems and why you should switch to a binary data serialization format for use in MapReduce.

The development and testing cycle with MapReduce, like many distributed systems, is long and in-efficient. Not only does it take a long time to build and run a MapReduce job, but it can be a pain to debug problems when they arise. I'm a strong advocate for unit and integration testing of MapReduce jobs up front, just for this reason. MRUnit is a valuable tool in this, because it's a true unit testing framework (your mapper/reducer gets ONE key/value) that obviates much of the boilerplate that would otherwise be necessary. If you haven't used MRUnit, this is a great tutorial.

I missed this one last week, but I think it's worth sharing this week instead. Spotify, the music streaming startup, uses Hadoop MapReduce (mostly via python streaming) to build datasets for recommendation systems. Traditional solutions to these problems often don't work at scale, and Erik speaks about how they do things at spottily.

A video from the LA HBase User Group -- "An Introduction to HBase". 1h48m but the presentation starts about 23 minutes in.

Starfish is a system for collecting data from a MapReduce job using BTrace for tuning and analysis. This post gives an overview of the architecture and a quick tutorial for deploying on your hadoop cluster.


Cassandra 1.2.1, a maintenance release, was released on Jan 28.;a=blob_plain;f=CHANGES.txt;hb=refs/tags/cassandra-1.2.1

I missed it last week, but Oozie 3.3.1 was released on Jan 25.


A neat visualization of the size of (publicly listed) Hadoop deployments using a word cloud. Yahoo! is the largest by far -- no surprise there!

Hortonworks joins OpenStack foundation.

WANdisco and Hortonworks announced a partnership. I'm not familiar with WANdisco, but they seem to offer a high-throughput cross-datacenter replication product.