Data Eng Weekly

Hadoop Weekly Issue #85

31 August 2014

This week’s issue features a lot of good technical content covering Apache Storm and Apache Spark. There are also a number of releases—Apache Flink, Apache Phoenix, Cloudera Enterprise, and Luigi. In addition, Hortonworks announced a technical preview of Apache Kafka support for HDP, and SequenceIQ unveiled Periscope, an open-source tool for YARN cluster auto-scaling.


The eBay blog has a post about NameNode Quality of Service. While running a large cluster, they’ve found that certain jobs can cause major issues by overwhelming the NameNode with too many RPCs. To combat that, they’ve worked on the FairCallQueue, which replaces the NameNode RPC handler’s FIFO queue. The post details the status of the implementation and shows how the implementation performs in their tests.

A number of organizations are working hard to enhance Apache Storm on several fronts. The fronts include security/multi-tenancy (Kerberos authentication, Hadoop security integration, user isolation), scalability improvements, high availability for the Nimbus service, and enhanced language/tooling support. The Hortonworks blog has an in-depth article discussing these features and more.

On the topic of Storm, the Hortonworks blog has posted more Hadoop Summit curated content covering Storm. It highlights seven presentations, which cover scaling Storm, Pig on Storm, R and Storm, and more.

Apache Storm integrates well with Apache Kafka (more below in an announcement from Hortonworks), and this tutorial builds a local environment using Docker and Fig for testing. It uses that environment to build a system for streaming log data through Kafka, using the Trident API to implement an exponentially weighted moving average, and sending alerts from Storm using XMPP.

Continuing on the Storm theme, the Hortonworks blog has a post on performing micro-batching with Storm. The post focuses on implementing micro-batching with the Storm APIs (the Trident API provides micro-batching, too). This post describes three different ways to implement micro-batching and provides an example implementation using the "tick tuples" approach.

Switching gears to the first of several articles on Apache Spark, this post covers Bayesian Machine Learning on Apache Spark. It discusses integrating the PyMC framework with Apache Spark to implement Markov Chain Monte Carlo (MCMC) methods. There are five parts to the post—an introduction to MCMC methods, an overview of the PyMC python package and its API, integrating PyMC with Apache Spark, using the integration for topic modeling with MCMC, and performing distributed LDA on Spark with PyMC.

An upcoming release of Apache Spark will contain implementations of several common statistics functions found in many statistical computing packages like R and SciPy.stats. This post describes the new implementations, which cover correlations (spearman and pearson), hypothesis testing (chi-squared), stratified sampling, and random data generation.

The Lambda Architecture is a popular idea for building hybrid batch and speed (near-realtime) data processing systems. This tutorial provides an example of implementing this type of system using Apache Spark. In addition to the normal batch operation, Spark also has a micro-batch mode called Spark streaming. The same data processing function can be used by both the normal and streaming operation, as is demonstrated in the post. The accompanying source code (written in Scala) is available on github.

Cloudera has posted a new roadmap for Impala, its SQL on Hadoop system. The post recaps the features of Impala 1.2, 1.3, and 1.4, and it describes what will be delivered in version 2.0 (by end of 2014) and version 2.1 (in 2015). The highlights for 2.0 include new analytic window functions, spilling of queries to disk, and subqueries. The highlights of version 2.1 include long-anticipated support for nested data, CRUD for HBase, and an exciting feature for folks running in AWS—Amazon S3 integration.

The Hortonworks blog has a post on HTTPS for HDFS. The implementation makes use of client certificate for HTTPS client authentication, which in turn are verified by the HDFS daemons. It has details on the configuration changes required to enable and setup HTTPS as well as a walkthrough of the various SSL certificates that need to be generated (complete with example keytool invocations).


Qubole has written about how they see Hadoop complementing an existing data warehouse (DW) deployment. They suggest a DW is more appropriate for structured data, whereas (large amounts) of unstructured data are better handled by Hadoop. Workloads that require SLAs/predictable runtimes should use the DW, but Hadoop is good at ad hoc or fluctuating workloads (and this is a key area where Qubole’s cloud offering adds additional flexibility). It can be hard to find the right place to draw this line, so it’s interesting to hear from a vendor (who is likely working with customers for something like this).

As someone who has run a production Hadoop cluster, a number of the points and anecdotes in this article ring true. The overarching theme is that Hadoop is not particularly good at meeting SLAs partly because it’s easy to use Hadoop in unpredictable ways. The article has quotes from some Pepperdata folks about how their cluster orchestration software helps solve these types of issues.

This article is focussed on Hadoop for non-engineers, particularly folks in the healthcare industry. After giving a brief intro to the key components of Hadoop, it talks a bit about some of the implications to the healthcare industry. Specifically, there are a number of types of analyses that can be powered by Hadoop which couldn’t be done before. But with that said, there’s an interesting point that rings true in nearly every profession. The industry isn’t being held back by the data processing systems—the barrier is in acting on the data to improve healthcare.

Big news in the development process of Hadoop this week—the Apache Hadoop codebase has migrated from SVN to git.

Once a Hadoop cluster gets to a certain size, there are ultimately conflicts related to required native libraries on the cluster. Docker, which provides a mechanism to running an isolated environment on a linux host, has great promise for solving this and other types of issues. GigaOm has an article that describes the state of the YARN and Docker integration, which is being lead by Altiscale.


Luigi, the batch processing framework for Hadoop, released version 1.0.17 this week. The new release has a number of fixes and improvements, including support for storing data in ElasticSearch, support for loading JSON data into redshift, an FTP task, and a new luigi command.

Apache Flink (incubating), previously known as Stratosphere, has released version 0.6. Flink is a data processing engine built atop of YARN, targeting iterative processing and data streaming. The release includes over 100 resolved tickets, which cover things like support for POJO and a new AvroOutputFormat.

Cloudera Enterprise 5.1.2 (which includes CDH 5.1.2 and Cloudera Manager 5.1.2) and CDH 5.0.4 were released. The CE 5.1.2 release includes a number of fixes covering nearly every component in the CDH stack. CDH 5.0.4 also includes a number of fixes across the stack.

Apache Phoenix 3.1 (for HBase 0.94.4+) and 4.1 (for 0.98.1+) were released earlier this week. Both releases contain a number of bug fixes, use of nested tables in queries, a Pig loader, and more. On top of that, the 4.1 release supports distributed tracing and local indexes.

Hortonworks has announced a technical preview of Apache Kafka for HDP 2.1. A post on the Hortonworks blog introduces Kafka and explains how it fits well with Apache Storm.

Periscope is a new open-source tool from SequenceIQ for auto-scaling and enforcing SLAs for YARN clusters. For a static cluster, it offers tools to enforce time-based and cluster capacity SLAs. In cloud environments, it can increase cluster capacity by spinning up new nodes. The code is available on github as part of a public beta. It’s used internally at SequenceIQ, but it relies on unreleased features of Apache Hadoop and Apache Ambari.


Curated by Mortar Data ( )



What is Practical Data Science? Co-hosted with Palo Alto Data Science Foundation (Menlo Park) - Thursday, September 4

Resistance Is Futile: What You Need to Know about Big Data (San Francisco) - Thursday, September 4


Apache Spark Night - Show and Tell (Austin) - Tuesday, September 2

Introduction to Hadoop Course, Part 1: Hadoop and Its Ecosystem (Austin) - Saturday, September 6


Understanding Your Customer’s Buying Journey Using Path Analysis on Hadoop (Phoenix) - Wednesday, September 3


Hadoop Security Deep Dive (Toronto) - Thursday, September 4


Managing Hadoop Workflows in the Enterprise + Jumpstart your Big Data Projects (Stockholm) - Monday, September 1


Large Datasets with WEKA + Big Data Use Cases & Industry Trends (Auckland) - Tuesday, September 2


OCG Meetup: Hadoop (Vienna) - Thursday, September 4


SQL en Hadoop: Un Gran Paso Adelante! (Mexico City) - Friday, September 5


Bangalore Hadoop - Big Data Meetup (Bangalore) - Saturday, September 6