Data Eng Weekly

Hadoop Weekly Issue #94

02 November 2014

Hadoop in the cloud (both open and public) is a big topic again this week. There are articles on Hortonworks' HDP in the Microsoft Azure cloud, Cloudera’s new cloud provisioning tool Cloudera Director, OpenShift, and SequenceIQ’s Cloudbreak. Also, there are several articles this week on Hadoop adoption, which seems to be limited by maturity of enterprise features. Finally, Kafka released version 0.8.2-beta this week, and a new project aims to provide higher throughput from Kafka for MapReduce jobs.


As the Hadoop ecosystem of projects grows and folks are using it in many different ways, integration between projects and consistency across projects are both important parts of usability. This article highlights several ways that the Hadoop ecosystem could improve along those lines. It’s just the tip of the iceberg—hopefully these things get better as Hadoop matures.

In the first part of a three-part series on HBase, this post presents an introduction to HBase’s data model and architecture. It also contains instructions on setting up a local HBase and interacting with it using the HBase shell.

The Cloudera Blog has a post on integrating the KiteSDK with OpenShift. Specifically, the Kite SDK has tooling for running in-process mini clusters (HDFS, Hive, Flume, HBase, Zookeeper) for testing as well as locally via the command-line. The post introduces these tools and describes work to add support for running a mini cluster via OpenShift to the command-line tools.

Hortonworks has posted a recording of and slides from a recent webinar on Apache Knox and Ranger, which are the main enterprise security products in their distribution. In addition, the post includes several questions and answers related to the offering. For anyone interested in enterprise security, this is a good overview of the current state of Hortonworks’ offerings.

While not directly related to Hadoop, this post summarizes a recent paper out of Facebook on their f4 BLOB storage system. The review notes that f4 is built atop of HDFS, and it describes how it gets around several HDFS limitations (namely adding cross-data center replication and using erasure coding to decrease replication factors). Definitely one of the more technical posts linked in this newsletter, but it’s quite interesting.

Label-based scheduling is a system for tagging resources in a heterogenous cluster and supplying boolean rules for scheduling jobs against these resources. The MapR blog has an overview of this feature for MapR’s distribution including a description of how it integrates with the Capacity Scheduler and Fair Scheduler. The community is looking to add a similar implementation to core Hadoop as part of YARN-796.

Spark vs. Tez has been a point of contention for a while now. Spark has gained momentum recently with several companies (including MapR and Cloudera) committing to it. Hortonworks, the main proponents of Tez, continue to tout Tez with a prototype implementation of the Spark API using Tez as the backend. In other words, it’s Spark on Tez on YARN (with data in HDFS). There is a discussion of the prototype and some benchmarks (as always, beware of vendor benchmarks—they’re typically not representative of your own workload) on the Hortonworks blog.

This post shows how to launch a HDP cluster on the Microsoft Azure cloud. Azure has a wizard for building both small-scale evaluation clusters and standard clusters (which have up to 45 worker nodes).

In another Hadoop-in-the-cloud post, Cloudera has an introduction to the new Cloudera Director for deploying CDH clusters in the cloud (supporting AWS initially). The post describes the data model, the server API, the user interface, the and the client.

This post introduces the Cloudbreak shell. Cloudbreak is a Hadoop as a Service system for deploying Hadoop clusters in the cloud. The post walks through setting up the command line tools and provisioning a Hadoop cluster.

MapR has a video (and transcript) of a whiteboard presentation comparing and contrasting Spark Streaming and Storm Trident. Both systems are micro-batching streaming frameworks. The presentation covers fault tolerance, ease of deployment, compatibility with YARN, and more.


Databricks and Hortonworks have announced an expanded partnership. As part of the expansion, the two companies are working on helping customers, engineering (namely enterprise features like security), and open source. Cross posts on the Hortonworks and Databricks blogs have takes from both companies on the expanded partnership.

EnterpriseTech has a post on the growth and adoption of Hadoop. It cites industry research and surveys as well as interviews with Hadoop vendors. The key takeaway seems to be that enterprise adoption isn’t quite there yet (only about 2,000 production deploys of Hadoop) but is on the verge of hockey stick growth.

SDTimes has an in-depth look at the Hadoop ecosystem. The article explores the various applications of Hadoop, its costs, tooling/support for ad hoc queries, language and library support for data science, security, and more.

SearchDataManagement also has a post about Hadoop adoption. This article interviews consultant and author Joe Caserta, who has been a bit surprised with the lack of adoption of Hadoop. The Q&A strives to explain why—maturity, support for interactive queries, and data governance are among the reasons.


Kangaroo is a new open-source project from Conductor for writing MapReduce jobs consuming data from Kafka. The introductory post explains Conductor’s use case—loading data from Kafka to HBase by way of a MapReduce job using the HFileOutputFormat. Unlike other solutions which are limited to a single InputSplit per Kafka partition, Kangaroo can launch multiple consumers at different offsets in the stream of a single partition for increased throughput and parallelism.

Amazon Web Services has updated their Amazon Kinesis Storm Spout in order to support Storm’s Ack/Fail semantics (the spout can re-emit messages). They’ve also published a white paper with a reference architecture.

Apache Kafka 0.8.2-beta was released. The new version contains a new Java producer, support for deleting topics, Scala 2.11 support, and a new configuration option to prefer consistency over availability.

As the repository name suggests, this is a project for building a docker image that allows running Hive on Tez. The project README has details on building the image, running it, testing it with some built-in scripts, and more.

Kylin is the recently open-source OLAP system from eBay. SequenceIQ has a docker image for running kylin, which includes support for Apache Ambari for managing the cluster.

Version 4.0 of Platfora, the analytics platform built with Hadoop and Spark, was released. The new release has new visualization and geo-analytics tools as well as insight delivery for sharing visualizations over email.


Curated by Mortar Data ( )



Introducing Apache Flink: A New Approach to Distributed Data Processing (Palo Alto) - Tuesday, November 4

State of Apache HBase, 1.0 Release, by Nick Dimiduk of Hortonworks (Los Angeles) - Thursday, November 6


Pivotal Business Data Lake (Tempe) - Wednesday, November 5


November Meetup: Clickstream Data Monetization Using Datameer (Fort Worth) - Thursday, November 6


Spark Gotchas and Anti-Patterns, plus Julia Language (Broomfield) - Wednesday, November 5


Unit Testing with Hadoop, plus Spark and Storm (Mayfield Village) - Monday, November 3

North Carolina

ORM for HBase (Durham) - Tuesday, November 4

IBM's Hadoop Integration with SAS Analytics: Using Hive (Durham) - Thursday, November 6

New Jersey

Fraud Detection = Spark + memSQL (Hamilton Township) - Tuesday, November 4


Data at a SaaS Company (Melbourne) - Wednesday, November 5


Offline and Real-Time Click Stream Processing (Amsterdam) - Thursday, November 6


Introduction to the Hadoop Ecosystem + Forming the HUG (Oslo) - Thursday, November 6


Hadoop Meetup @ IBM EGL (Bangalore) - Friday, November 7

Hadoop Hands-on/Demo, plus Big Data Industry Trends and Opportunities (Chennai) - Saturday, November 8