Data Eng Weekly

Hadoop Weekly Issue #14

21 April 2013

There were a number of exciting announcements and releases this week (e.g. Hadoop on OpenStack, Impala 0.7) as well as some fantastic technical articles and tutorials. It's great to see more technical articles about how folks are doing things with Hadoop -- this week covering Hadoop internals, data formats, and MapReduce-based mobile UI customization. A big thanks to those that share their insights and experiences for making this newsletter possible!


Cloudera has announced the Cloudera Academic Partnership program with seven universities part of the initial program. Cloudera cites the need for Hadoop-related expertise as the main motivation for the program.

Apache BigTop is a system for testing the components of the Hadoop stack in conjunction with one another. This post highlights why it's becoming popular (particularly with vendors) even though it's not in the spotlight.

We've been seeing a lot of new products focusing on running SQL on HDFS. Most of these products distribute worker-nodes alongside the datanodes. Teradata has taken a different approach (they kind of have to since they ship an appliance). This week, in addition to announcing a new set of hardware, they announced SQL-H which gives Teradata access to data stored on HDFS by using HCatalog to get metadata about the files in HDFS.

Mirantis, Hortonworks, and Red Hat are working on project Savanna to bring Hadoop support to OpenStack (OpenStack is software for managing cloud computing software). It sounds like they're targeting an initial release for June in time for Hadoop Summit, and that they have some grand plans -- everything from provisioning bare-metal hardware to enabling something like Amazon's Elastic MapReduce.


This is a great overview of the components in the Hadoop stack other than HDFS and MapReduce -- in particular, HBase, Cassandra, Pig, Hive, and Impala. It also discusses a few other SQL-on-Hadoop solutions.

While its architecture is easy to understand, HDFS is a complex piece of software that oftentimes seems to work as if by magic. This article discusses the architecture and starts diving into the software stack -- providing a map for someone trying to navigate the source code.

Tom White, the author of Hadoop, The Definitive Guide, is writing a series of posts for Dr Dobb's about Hadoop. The first article has an overview of HDFS and MapReduce as well as an introduction to various other systems in the Hadoop stack like Flume, Pig, Hive, and HBase.

LinkedIn is using Hadoop-based algorithms to customize the UI on their mobile apps, where real estate is limited. Their infrastructure includes Kafka for data ingestion, a Hadoop workflow for building recommendations (which they describe in some detail), and Voldemort for serving the data in real-time.

At the Twitter Seattle Open House, Julien Le Dem presented on Parquet, the new columnar storage format that Twitter is building in collaboration with Cloudera. The slides include a great overview of the use-case, the file format, and some initial benchmarks.


Cloudera HUE provides a web interface to interact with Hadoop to upload and browse data as well as run Hive and MapReduce jobs. In this tutorial, you'll load a dataset from the Yelp challenge into Hive, run some SQL queries on it, and then run a python streaming MapReduce job using MrJob.

Redis is a key-value store that supports various data structures such as lists, sets, strings, and more. This tutorial covers getting data in and out of Redis from MapReduce, including the code for custom input formats, record readers, and output formats.

A tutorial covering running Apache Mahout on HDInsight (HDInsight is the Hadoop Distribution running on Windows Azure). Covers install, setup, and running a Mahout MapReduce job.


Cloudera Impala 0.7 was released (and a few days later the 0.7.1 release with some critical bug fixes was announced). Version 0.7.1 has a bunch of new features, including support for the Parquet columnar file format and avro, plus distributed aggregations and top-n computations. This release supports CDH4.1 and 4.2 as well as a number of different linux distributions.

Apache MRUnit, the MapReduce unit-testing library reached version 1.0.0. It supports both hadoop 1 and hadoop 2.

Last week, Amazon announced support for Elastic MapReduce on their GovCloud service.

UC Berkeley's AMPLab, the same lab that develops Spark, has announced the Tachyon Project. Tachyon is a distributed file system that can cache some datasets in memory, but it checkpoints data to an underlying file system (it currently supports HDFS or a single node local file system).


Curated by Mortar Data (

Monday, April 22 Cloudera Sessions (Toronto, Canada)

Tuesday, April 23 Natural Language Processing and Big Data (Washington, DC)

Wednesday, April 24 Big Data @ Yelp -- taming the reviews & recommendations (San Jose, CA)

Thursday, April 25 Data in the Big City (New York, NY)

Thursday, April 25 Power in Numbers: Growing Atlanta's Data Science Talent (Atlanta. Georgia)

Thursday, April 25 Bigvis: visualising 100,000,000 observations in R with Hadley Wickham (New York, NY)

Saturday, April 27 Map Reduce Programming - Deep Dive (Santa Clara, CA)