Data Eng Weekly

Hadoop Weekly Issue #11

31 March 2013

This week's newsletter is a little shorter than recent ones -- which is to be expected post-Hadoop Summit EU. The lack of quantity is made up for by the quality of this week's articles, though, which touch many different parts of the Hadoop ecosystem.


Apache Hadoop's Distributed File System and MapReduce were originally based upon research papers written by Google. Google owns a number of patents in these spaces, including 10 related to MapReduce. This week, they pledged "not to sue any user, distributor or developer of open-source software on specified patents, unless first attacked."

MapR and Canonical announced that MapR's M3 Hadoop distribution will be an integrated offering into Ubuntu 12.04 LTS and 12.10 via the Ubuntu Partner Portal.

MapR also made some new this week by open-sourcing their forks of a number of projects in the Hadoop stack (but not MapR FS). The list includes, sqoop, pig, mahout, hive, hbase, oozie, opentsdb, scribe and more. Some of these projects haven't been updated in a year (scribe) but the majority were updated in the past month.

Netflix has an in-depth blog post about building recommendation systems. They discuss the three types of systems that they use -- offline, nearline, and online, as well as the trade-offs and design decisions you have to consider for each. The post contains a number of detailed system diagrams with thorough explanations about the systems that they use from hadoop to cassandra to mysql.

GIS Tools is a set of open-source tools for spatial data analytics on Hadoop from the folks at Esri. It includes java libraries for integration into MapReduce as well as lots of Hive UDFs for spatial and geometric processing.

HUE is an open-source web interface to Hadoop, which includes a number of applications such as an HDFS browser and a web-based interface into Hive called Beeswax. This tutorial describes loading tweet data into HDFS, creating a table for that data in Hive, and running an analysis of the data using Hive.

Microsoft's HDInsight platform includes a developer edition for local development and testing. In this post, the author describes the process of setting up a local development environment on Windows, writing a map reduce job using C#, running your job on the cluster, and loading the data into Hive.

Matt Walker presented on Etsy's data stack at Data Day Texas. He covers the evolution of their offline infrastructure over several years, systems that they've built, and the tools & frameworks that they use for offline analytics.

The Call for Speakers for HBaseCon 2013 ends tomorrow, 4/1. Cloudera interviewed some members of the HBaseCon Program Committee about HBase and HBaseCon.

Debugging MapReduce is usually a matter of adding counters or System.out.println statements to determine what is happening. If you're developing on a local box, though, a simpler solution is to attach a debugger. This walkthrough has all you need to know about debugging a MapReduce job in IntelliJ.


Apache Oozie 3.3.2 was released with a number of improvements and bug fixes. Among the highlights are: uberjar support for MapReduce actions, improvements to the ooze web ui, and improvements to the command line for coordinators.