Data Eng Weekly

Hadoop Weekly Issue #29

04 August 2013

Lots of slides and videos were posted online from HBaseCon and OSCon, and there are some great technical articles about several parts of the Hadoop stack -- from bootstrapping hardware with Dell Crowbar and Cloudera Manager to building your first application with the Hortonworks Sandbox. This week also saw a few exciting releases -- Parquet and Hadoop 1.2.1, as well as two new episodes of the All Things Hadoop podcast. Enjoy!


OSCon was July 20-24th in Portland, and there were a number of talks about software from the Hadoop ecosystem. Slides posted online include talks covering Apache Zookeeper, Apache Hive/HBase, Apache Hadoop, Luigi (a python workflow framework with Hadoop support), and Cascalog (data processing for Hadoop). There are also a ton of other interesting presentations unrelated to Hadoop.

Scott Leberknight of Near Infinity posted an overview of Cloudera's Impala as well as a bake-off between Impala and Hive running in a VM on a MacBook Pro. Given the caveat that the tests are unscientific (and the way in which he chose the queries isn't given), he saw that Impala was between 12-39x faster than Hive 0.11. It's interesting to see these results, but I'd also like to see results on real datasets on more than one machine.

Apache Drill is an Apache incubator project which aims to provide low-latency, scalable SQL-on-Hadoop. The folks from MapR are doing a lot of work on Drill, and they ran a workshop on Drill at OSCon. There are a lot of slides, covering the goals, features (such as supporting nested, schema-less data), the execution layer (vectorization, columnar storage, MPP query engine), and runtime compilation in Java.

Episode #9 of the All Things Hadoop Podcast features an interview with Paco Nathan about Apache Mesos (and how it compares to YARN) and Cascading (and how it fits into the common big data workflow). The page linked below contains a recap of the discussion, as well as a link to download the audio.

The 10th Episode of the All Things Hadoop Podcast was also posted this week. Tom White, the author of Hadoop: The Definitive Guide is the guest, and he speaks with host Joe Stein about the Cloudera Developer Kit, Parquet, and Apache BigTop.

GigaOm has an interview with folks from Airbnb about their offline data infrastructure and what features it is (or will be) powering. The story also provides so more information on how Airbnb uses Apache Mesos to run Storm, Hadoop, and Spark.

Cloudera has posted a number of videos from HBaseCon 2013, with a few more to come. The links to individual talks from the schedule page include links to slides and videos (if available).

Dell Crowbar is a provisioning and configuration system that manages all parts of the hardware and software stack -- from bare-metal to higher-level software frameworks like Hadoop. In this post, Mike Pittaro from Dell gives an overview of Crowbar and how it builds a Hadoop cluster from bare-metal by making use of the Cloudera Manager APIs after an OS is installed.

The folks from Endgame open-sourced BinaryPig, their Apache Pig-based system for doing binary data extraction to find malware. They also have a Django/ElasticSearch webapp for doing exploratory analysis on data generated from Pig.

Brenden Matthews of Airbnb presented on running Apache Hadoop on Apache Mesos. Recently, they moved from Amazon's Elastic MapReduce to running Hadoop on Mesos in EC2. In order to do so, they built a Mesos Scheduler that runs atop the JobTracker to launch TaskTrackers for each job.

Getting started with Hadoop can be overwhelming -- many vendors provide VMs to get you started, but figuring out where to go from there is a challenge. In this post, the author uses the Hortonworks Sandbox (which packages HUE -- a detail I hadn't realize before), to upload data, run a Pig script to load the data to Hive (via HCatalog), and query the derivative dataset using Hive. The tools for beginners have come a long way in the past few years, and it's cool to see various parts of the ecosystem fit together right out of the box.

The Apache HBase blog has a post about adding support for data types to HBase. If you've worked with HBase, you're familiar with translating your objects to and from byte[], which is the only data type that the HBase API supports. After getting through some information theory and clarifying the statement "HBase is a database," the post talks about proposals for new data encoding and data type APIs in HBase.


Cloudera has launched a new medium for discussing and asking for help with Cloudera software -- the Cloudera community forums. The forums aim to compliment existing solutions such as the various mailing lists that Cloudera runs.

Transparency Market Research is expecting the Hadoop market to reach $20.9 Billion by 2018 (based upon a market of $1.5 billion in 2012 and a cumulative annual growth rate of 54.7%). The article notes that the lack of qualified Hadoop experts and the lack of awareness in large and mid-sized business will hamper adoption. But they expect growth in the telecommunications industries and Europe to accelerate in the next few years. The report also covers some of the major players in the field, and what to expect from them in the next few years.

GridGain, who offers an In-Memory Computing platform, announced that they've raised $10M in Series B funding. GridGain seems to be focussing on real-time processing, but they also offer a Hadoop-compatible file system called GridGain FS which provides performance improvements by keeping data in memory (all data or some data acting as a read/write through cache).


Parquet, a columnar storage system for Hadoop, hit version 1.0 this week. The artifacts are available on maven central, and it includes compatibility with the key players in the Hadoop ecosystem. Specifically, Parquet 1.0 supports: MapReduce, Pig, Hive, Cascading, and Impala across Hadoop versions 1.0 and 2.0. It has support for working with Avro and Thrift records, dictionary encoding, and bit-packing/run-length encoding. There were 18 contributors to the codebase leading up to this release.

Version 0.5.0 of the Cloudera Development Kit was released. It has several new features and improvements such as the ability to run examples on the host machine without a VM, the ability to parse xml and html input, and an upgrade to parquet 1.0.0 and Crunch 0.7.0.

The Hadoop 1.x line hit its first stable release since February with this week's release of 1.2.1. The release contains 18 bug fixes atop the 1.2.0 release. The 1.2 line features a number of enhancements over 1.1.x line including: back ports of DistCp v2 and the Offline Image Viewer, WebHDFS enhancements, and web services for the JobTracker.