Data Eng Weekly

Hadoop Weekly Issue #93

26 October 2014

Given the torrent of Strata + Hadoop World news last week, it’s no surprise that this week’s edition is a bit shorter than normal. With that said, the amount and quality of technical content in this edition is above average—posts on Storm, HBase, HDFS metadata, Docker-in-YARN, and much more. Security is also a hot topic this week—in addition to technical posts, Cloudera announced that Cloudera Enterprise has achieved PCI compliance.


The Yahoo Storm team has written about Storm at Yahoo. The post describes the history of Storm’s adoption at Yahoo and some early products powered by it. It then describes a number of improvements that Yahoo made (including a netty-based messaging setup, several security features, multi-tenancy, and Storm-on-YARN). Finally, there are some notes on new features in the works.

This post shows how to use Cascading to run a TopK query with the recently added Tez backend. The post has code examples, which it walks through in detail, and the full code is available on github.

The Hortonwork’sblog has a post describing how to configure the HBase REST server, HiveServer, WebHDFS, and Oozie with SSL encryption and certificates. Once these services are configured for SSL, Apache Knox can then be configured to talk to the services over SSL—providing end-to-end encryption. The post has a lot of low-level details (including keytool commands, config file options) for this setup.

A post on the blog has an overview of the major differences between Sqoop 1 and Sqoop 2. Whereas Sqoop 1 is a standalone tool, Sqoop 2 is a client-server architecture with a management UI and command shell. The post also describes the status of Sqoop support in Oozie.

This post describes a system for analyzing trading data in real-time and in batch using several big data tools. The system takes in data in real-time into Kafka, has implemented a rule engine with Storm, stores data for dashboarding and visualization in Cassandra, and uses Hive to perform batch analysis on data ingested from Kafka to Hadoop using Camus. The source code for the project (called wolf), is available on github.

HTrace is an open-source library from Cloudera for distributed tracing inspired by Google’s dapper paper. It’s used for finding bottlenecks in RPCs and distributed systems with low-overhead. This presentation gives an overview of the tracing model, how to enable it with HBase, and more.

This presentation covers the upcoming Apache HBase 1.0 release. The talk covers the history of HBase, gives a brief introduction to the architecture, describes some major changes for the 1.0 release (co-locating Meta with Master, Region Replicas for improved availability, and more), and describes the upgrade path to 1.0 from previous versions (nothing that Hadoop 1.x and Java 6 are not supported).

This post is an in-depth description of the various files that the HDFS NameNode and JournalNodes maintain to store HDFS metadata. A lot of things have changed in the setup with the introduction of HA NameNode, so it’s quite useful if you’re only familiar with the previous implementation. In addition to an overview of all the files, there’s also a description of several commands and settings related to HDFS metadata.

Kubernetes-YARN is a new project (currently in prototype/alpha) to provide a mechanism for running Docker containers (via the Kubernetes container cluster manager) alongside YARN applications. This introductory post describes the architecture and provides a walkthrough on a vagrant-based single-node cluster in which an nginx docker container is run on the YARN cluster.

Cloudera has two posts focussing on new features in the recently released CDH 5.2. The first post provides an introduction to Kerberos and LDAP, describes how they’re integrated into Impala, and shows how to setup Impala to run in a secure environment with LDAP and Kerberos enabled. The second post is about several new features in Hue, including a new Security app and improved dashboards for Search and Oozie

In celebration of the 1-year anniversary of the release of Apache Hadoop 2.2.0, Xplenty has a three-part blog post on YARN. The posts looks at some of the challenges in upgrading to YARN from pre-YARN Hadoop, the “renaissance” of Hadoop (i.e. the plethora of new projects-particularly SQL-on-Hadoop), and the rising popularity of Apache Spark .


This post looks at how American Express is using Hadoop for several new products and services. Hadoop applications are used to analyze transaction and social data in both real-time and in batch. The post includes details on their Hadoop deployment, which is built on 2U servers with 24 disk bays and dual-10GbE networking. They have also shared some performance numbers from a TeraSort run on a 255-machine chunk that was added to the cluster, in which they sorted a terabyte in 45s (this was in 2013). Many more details in the article.

The Insight Data Engineer Fellows Program is a six-week program to help engineers gain experience with data engineering technologies and meet folks from industry. Applications for the next session are due tomorrow, October 27th.

Videos of all keynotes from the recent Strata + Hadoop World have been posted on Youtube. There are also interviews with a number of folks in the Hadoop industry.

Xplenty, makers of a data integration platform built on Hadoop, have announced that they’e raised $3 million in series A financing. Datanami has more details on Xplenty and their product.

Datanami has two posts about last week’s Strata + Hadoop World. The first covers the keynote by Cloudera’s chief strategy officer, Mike Olson, in which he predicted Hadoop’s disappearance (i.e. people won’t spend as much time in the weeds getting the tech to work, they’ll focus on applications). The second post covers several announcements from the conference from the likes of Cray (a new Hadoop appliance), Revolution Analytics, Pentaho, and more.

Cloudera announced that Cloudera Enterprise has been “fully certified as compliant with Payment Card Industry (PCI) Data Security Standards.” The first company using the certified product is MasterCard.

Spark Summit East is taking place in New York City on March 18 and 19th, 2015. The call for presentations is now open until December 5th.


Amazon Web Services has supported Spark on Elastic MapReduce (EMR) for over a year by way of a bootstrap action to install the software on the cluster. They’ve recently added support for Spark 1.1.0 on Hadoop 2.4.0 with the Hadoop AMI version 3.1+.


Curated by Mortar Data ( )



HBase Meetup @ 4 Infinite Loop (Cupertino) - Monday, October 27

Galvanize Data Science Launch (San Francisco) - Wednesday, October 29

RHadoop - Scaling the R Language for Big Data Analysis (San Ramon) - Wednesday, October 29

Introducing Apache Flink, a New Approach to Distributed Data Processing (Pasadena) - Wednesday, October 29

Hadoop as a Service: Is the Market Now? Is Hadoop Ready for the Cloud? (Sunnyvale) - Thursday, October 30


CloudBreak: Hadoop on Docker (Ballwin) - Saturday, November 1


Hadoop Security: Managing Big Data in a Dangerous World (Urbana) - Tuesday, October 28


Indy Big Data Monthly Meetup (Carmel) - Wednesday, October 29


Spark Bake-Off (McLean) - Thursday, October 30

North Carolina

RDBMS on Hadoop? Talk & Hands-on Session from Splice Machine (Charlotte) - Wednesday, October 29


Show and Tell Night (Cambridge) - Tuesday, October 28

Big Data (Part 1): Overview (Plymouth) - Thursday, October 30


October Meetup (Mannheim) - Monday, October 27


Managing Data in a Hybrid Hadoop & RDBMS Environment (Brisbane) - Wednesday, October 29


Bucharest HUG, October Meetup (Bucharest) - Thursday, October 30