Data Eng Weekly

Issue #1

20 January 2012

Welcome to the inaugural (good timing for those of us in the US) Hadoop Weekly! My plan is to compile links to articles and stories about Hadoop (and its offshoots) into this weekly newsletter. Each link has a brief summary so you know what to expect before clicking through. If you have ideas or feedback, feel free to email me at info@ this

There's been a lot of really interesting information shared in the Hadoop community the past week. Let's get to it!

Tips, tricks and how-tos.

A large list of MapReduce frameworks and links to documentation for each.

A quick, self-contained example of using awk with Hadoop streaming over data in HDFS.

The folks at Mortardata have put together a great list of books for improving your Hadoop and big data educations.

Allen Wittenauer of LinkedIn (and previous Yahoo), shares tips for avoiding common pitfalls when administering a Hadoop cluster.

An overview of how HBase is designed to avoid consistenty issues with concurrent reads and writes. Quick read, with great diagrams. Easy to understand if you're familiar with the HBase write path.

An interesting introduction to MapReduce using the word game Boggle. Includes code on github for the BoggleWordMapper discussed in the article.

If you've ever been confused about the versioning of Hadoop releases, this post contains a concise explanation of all the recent versions.

A great list of lessons learned and common pitfalls you encounter when building distributed systems. Each point, e.g. "Writing cached data back to storage is bad." comes with a brief explanation of the key gotchas.

Oozie, the open-source workflow manager for Hadoop, is powerful but hard to tame. This post details all of the parts (java, properties files, and XML) that are needed to get your job into Oozie on a recurring basis.

Peaks under the hood

Netflix uses S3 and Elastic MapReduce (EMR) for their Hadoop platform. They've implemented a scheduling meta framework called Genie, that allocates jobs to one of many EMR clusters and does some interesting things with Amazon Web Services' auto scaling groups.

Details on Facebook's MySQL backup strategy, which includes archiving to HDFS.

The video for Dmitriy Ryaboy's talk from Strange Loop 2012 about Analytics at Twitter was posted on infoq. He talks about lessons learned scaling analytics infrastructure to multiple teams and many users, and he makes a number of recommendations. He notes that the core tech stack is the same even though their scale has changed drastically between 2010 and 2012, but a number of processes have changed since then.

Details on how the Obama campaign used Hadoop and Vertica in their data platform.

Hadoop is a relative new-comer to scientific computing, where super computers have been used for years. Evert Lamberts talks about how SURFsara, the Dutch national center for academic IT, uses Hadoop to complement their other supercomputing systems.

LinkedIn has put a lot of thought into tuning their Hadoop cluster. This presentation speaks to their node hardware configuration and application optimization.


As you may know, Hive 0.10 was recently release. This article has a good overview of the new and exciting features in Hive 0.10. It includes links to the relevant JIRAs for further reading.

Hortonworks has released version 1.2 of HDP, their Hadoop distribution. It includes a new version of the Ambari cluster management software, as well as improvements to Hive and HBase.

Scoobi, a high-level MapReduce framework written in Scala, was updated with an in-memory test mode, improved support for avro, improved support for libjars, and many more improvements.

Apache BookKeeper, a distributed logging and pub-sub system, released version 4.2.0. This release "features a new autorecovery mechanism for BookKeeper ledgers, read-only bookie mode, improved scalability for Hedwig, message filtering in Hedwig and much more.

Scalding, a Scala MapReduce framework that wraps Cascading, was updated with initial support skew-join, bugfixes, cascading trap-support.

Version 0.1.6 of hbase-jruby, a JRuby interface to HBase was released with support for both HBase 0.94 and 0.92

Cloudera announced the availability of Impala 0.4. Impala is Cloudera's low-latency query engine built atop HDFS. This is the first release to support RHEL/CentOS 5.7 (the first supported OS other than RHEL/Centos 6.2). See the second link for an overview of the Impala architecture.!topic/impala-user/ftqWMGM2O-U

Industry news

Skytap has added support for Cloudera Manager within their cloud service. They support both fully-cloud and hybrid deploys.

MapR has hired a VP for EMEA and is opening a European headquarters in London.

Recap of a round table featuring execs from Oracle, Cloudera, and Facebook discussing Hadoop.

Zettaset develops tools for enterprises to manage secure deployments of Hadoop. They just raised $10M in series B funding.

Datameer has put together an interesting visualization that shows the partnerships between companies in the Hadoop ecosystem