Data Eng Weekly

Hadoop Weekly Issue #48

15 December 2013

There was a lot of coverage this week of the business models of key Hadoop vendors—the discussion seems to be spurred by the expectation of a big 2014 for Hadoop. It certainly seems to be shaping up that way with YARN starting to see production deploys and lots of ecosystem projects maturing. On the technical side of news this week, Netflix open-sourced their event logging system, Suro, Amazon Web Services announced support for Impala, and Hortonworks announced technical previews for Storm and Apache Knox. And there’s much more to sink your teeth into below.


The Altiscale blog has a post highlighting how opaque and difficult it is to find the root cause when debugging a MapReduce job. The post also brings up an interesting point on recoverable vs. unrecoverable errors—MapReduce retries all tasks whether an error is recoverable or not. The post concludes that debugging Hadoop is too difficult, but it will get better in the future. is using HBase and Hadoop to store and analyze its data—which has grown from 4 petabytes to 10 petabytes over the past year. This article is a high-level overview of the pieces they use, including a Hadoop and HBase implementation of the Germline algorithm for DNA analysis.

Netflix has open-sourced its event logging system, Suro, which is a descendant of the Apache Chukwa project. The Netflix techblog covers the history of Suro, its architecture, and how it’s integrated with the Druid real-time analytics system and Elastic Search. The post also includes some performance numbers from stress testing Suro.

Videos from the recent Spark Summit have been published online. There are a number of interesting talks, including several related to the relationship between Hadoop and Spark.

The Hortonworks blog has an post about the state of JVM-support for Hadoop (focussing on the Hortonworks Data Platform). The quick summary is that Hadoop is starting to be less finicky about which JVMs it supports—it used to only work well with certain builds of the Sun JDK 6. In contrast, HDP 2 now supports Oracle JDK 6&7 and OpenJDK 7.

Sqoop is a system for exchanging data between HDFS/HBase and a relational database. To do so, Sqoop typically uses JDBC connections with password-based authentication. This tutorial from the Hortonworks blog shows how to use the Oracle Wallet to authenticate Sqoop with an Oracle database.

The Cloudera blog has an article on HBase compaction. The article covers the motivation (and trade-offs), how files are selected for compaction in CDH4 (HBase 0.94), and some changes to compaction in CDH 5 (HBase 0.96).

The Hortonworks blog has a post on Hadoop Security. The post summarizes the existing security from several Hadoop-ecosystem projects—authentication via Kerberos, authorization via HDFS file permissions/HBase ACLs/etc, audit logging for HDFS/MapReduce/Hive/Oozie, and encryption for RPC/HTTP/JDBC/etc. It then goes on to discuss security plans for the future—the Apache Knox gateway, token-based authentication, granular authorization for Hive/ACL support in HDFS, improved accounting/auditing, and improved data encryption.

Search Storage has a guide (with many links to related articles) to Hadoop technology. The guide covers the basics of Hadoop technology, dealing with Hadoop pain points, understanding Hadoop storage, and how Hadoop technology works with the cloud.

The Cloudera blog highlights some of the new features in Cloudera Manager 5. Those include: workload & resource management (new and improved for both YARN and cgroups), service extensibility (the ability to use CM to deploy custom services), and monitoring enhancements (new graph types, Impala/YARN support). The post also covers some new features to expect in Beta 2 of Cloudera Manager this spring.

InfoWorld has some coverage of the new Apache incubator project, Twill. Twill is a high-level framework for implementing a new application on YARN. The article includes an interview with Continuuity (who originally wrote Twill) CEO John Gray, who shares some thoughts on the motivation for joining Apache and future plans for Twill.


GigaOm has coverage of the never-ending “Hadoop war of words.” The story recaps a recent post on the Hortonworks blog, which explains their business model (and why they think it’s best). The post has some background on the public battle and more details from the Hortonworks post.

I found this overview of the business/software models of Hortonworks, Cloudera, and MapR to be a concise guide to a complex topic. The business plans and open-source models seem to be a spectrum with Hortonworks (100% open-source) on one side and MapR (the most proprietary features) on the other.

If you can’t get enough about Hadoop vendor business models, then there’s one more article you should read. An article on Datanami first highlights that in 2014 we’ll see years of R&D on YARN make it into customers’ hands before going into detail about the business models of Hontonworks and Cloudera.


Amazon announced support for Cloudera Impala running atop of Elastic MapReduce. Impala can be added as an ‘additional application’ when starting a Hadoop 2.x EMR cluster. It doesn’t look like AWS has implemented an Impala-S3 bridge—data must be stored in HDFS or HBase on the EMR cluster in order to be queried.

Cloudera has renamed the Cloudera Development Kit to Kite, and released Kite version 0.10.0. The new release is just a rename vs. CDK v0.9.0. The goal of the rename is to expand the application of the SDK since its useful outside of Cloudera’s CDH.

Hortonworks announced the availability of the previously publicized Storm integration with HDP as a Technical Preview. The preview bundles Storm, and it will be available in a “fully certified version” in H1 2014.

Hortonworks also announced a technical preview of the Knox Gateway, the Apache incubator project. Knox is a REST gateway to all parts of the Hadoop ecosystem (HDFS, MapReduce, Hive, etc). The tech preview is based upon the 0.3.0 release of Knox.


Curated by Mortar Data (

Tuesday, December 17

Dec. 17 Meetup: Performance Update, MADlib, and More (Palo Alto, CA)

Wednesday, December 18

Bay Area Hadoop User Group (HUG) Monthly Meetup (Sunnyvale, CA)

Dec Meetup (Pittsburgh, PA)

Big Data Use Case - Logging with Apache Flume and Hadoop/HDFS (Arlington, VA)

Thursday, December 19

Real-time Streaming and Data Pipelines with Apache Kafka (New York, NY)