Data Eng Weekly

Hadoop Weekly Issue #111

08 March 2015

This week there were three announcements of new open-source tools/projects for the Hadoop ecosystem: Airpal (for Presto DB) from Airbnb, json utils for Apache Pig from Mortar, and a new kafka-mesos framework. There are several good technical articles across encryption, YARN, Hive, Spark, Presto, Samza, and more. The Tachyon project also gets two mentions this week, including an analysis suggesting it's picking up steam.


Encryption, which plays an important role in compliance, was recently added to many components in the Hadoop ecosystem. Intel (who is an investor in Cloudera) played a major role in the implementation as well as in adding speedups for Intel processors (using Intel AES-NI). In this post, Intel describes the basics of Hadoop encryption using hardware acceleration and provide some performance results from a single node running CDH 5.2.

The MapR blog has a guide describing how to run MapReduce jobs in non-JVM languages with Hadoop Streaming (not to be confused with stream processing).  The guide describes how Hadoop Streaming exchanges data with the external process, and it has examples for perl and python.

Hortonworks has a technical overview of the Hive 0.14 cost-based optimizer (CBO). The post describes the steps of query optimization, several types of optimizations (e.g. for left-deep joins and bushy-joins), illustrated examples of how the CBO improves join order, join simplification, and some performance numbers.

Hortonworks has a second post in a series describing fault-tolerance for long-running services in YARN. This post describes new features in the ApplicationMaster (AM)—previously a restart of the AM would restart all container it launched. The post describes the strategy that YARN uses to preserve running containers as well as the new AM retry windows which removes the upper-limit on the number of AM restarts.

The Apache Spark community has been working on DataFrames, a new type of distributed collection. There was recently a presentation on DataFrames at the Spark User Meetup—the slides and video are now available online.

This post has a lot of interesting ideas about replacing a database (described as mutable global state) with a distributed immutable event log. There's a video of a presentation from Strange Loop, and the presentation has also been transcribed (with slides inline). The talk describes how to implement a number of these ideas with Apache Kafka and Apache Sazma, and it includes many important ideas for anyone working with distributed systems.

This post describes some of the types of applications and challenges that arise with the internet of things, which produces lots of timeseries data. NoSQL databases like HBase and MapR DB are optimized for common-queries on these types of data, and the post describes further techniques for optimizing the storage footprint (via the schema that OpenTSDB uses).

This post, the first in a series, looks at building a similarity graph for health-care providers and computing personalized PageRank to identify anomalies. The analysis uses the Medicare Part-B public domain dataset, and the software (Pig and Python) is available on github.

The Cloduera blog has a guest post describing how to use Apache Spark for financial analytics calculations. Specifically, the authors look at how to calculate credit valuation adjustments, which use Monte-Carlo simulations. There is some sample code, using PySpark and a python machine learning library (MLIB).

Hortonworks has a post on Apache Slider, a tool for deploying long-running applications on YARN. Slider, which is part of HDP 2.2, includes support for running Apache HBase, Apache Accumulo, and Apache Storm. The post describes how slider works and some of the benefits of deploying a YARN application user Slider.

The Morning Paper is doing a series on consensus in distributed systems. Last week covered a number of classic papers (e.g. "Viewstamped Replication" and "Paxos Made Simple"), and next week will cover a paper on Apache Zookeeper's protocol, a recent paper on the Raft consensus protocol, and more.

This post describes how to build a user-defined function (UDF) for Presto DB, the SQL engine open-source by Facebook. Qubole has provided an example project and describes Presto's UDF API.


The Apache Flink project, which is a large-scale data processing engine with a similar API to Apache Spark, is rapidly gaining new features and integrations. This post highlights recent additions to the Flink ecosystem such as integration with Apache SAMOA (incubating), the Flink graph API, support for HCatalog, and support for Kerberos-enabled Hadoop clusters.

The Qubole blog has a digest of 10 articles from the big data industry—covering everything ODP to the maturity of Hadoop to a new framework for running C/C++ code on Hadoop.

The Call for Proposals for Strata + Hadoop World in New York is now open. Proposals are due by end of day (EDT) on April 7 for the conference which takes place Sept 29-Oct 1.

Cisco announced this week that they will resell software from Cloudera, Hortonworks, and MapR. Their provisioning tool supports all three distributions and is now generally available.

On the heels of Hortonworks' recent quarterly earnings report, there have been a number of articles analyzing the performance of the company and the Hadoop industry as a whole. Infoworld looks at both—including comparisons between Cloudera and Hortonworks and between Hadoop vendors and the other big open-source company, Red Hat.

Nextgov has a story on the CIA's plans to deploy a Hadoop cluster (running Cloudera Enterprise) in Amazon Web Services' cloud for the intelligence community. In-Q-Tel, the not-for-profit venture capital firm for the intelligence community, is an investor in Cloudera.

The DBMS2 blog has a post on the Tachyon project (version 0.6 was released this week—more details below), which provides an in-memory distributed file system. The post gives a brief overview of Tachyon, Tachyon deployments, and some thoughts on when it's best to use Tachyon.


Tachyon released version 0.6.0 this week. The new version includes support for tiered storage (e.g. memory, SSD, HDD), docker and vagrant support, and uses netty for data transfer.

Mortar (now part of Datadog, who curates the event section of this newsletter) has open-sourced some tools for working with JSON from Apache Pig. There is a JsonLoader for loading arbitrary files, and UDFs to apply (or infer) a schema for a given json dataset.

Airbnb has open-sourced Airpal, their web dashboard for running SQL queries using Facebook's PrestoDB. The tool includes table and query searching, access controls, the ability to track query process, support for creating Hive tables from results, and more. Airpal is written using Dropwizard and react.js, and the code is available on github.

The Kafka Mesos Framework provides the capability to run Kafka with Mesos. It includes command-line tools for adding brokers, inspecting status, and more. It's an early release (alpha quality), but a lot of functionality is built and documented in the repo readme.

Version 1.1.0 of Luigi, the workflow framework for big data tools, was released this week.  The most notable improvement in this release is support for Python 3. Other improvements include better support for outputting data to S3 and documentation improvements.


Curated by Datadog ( )



March SF Hadoop Users Meetup (San Francisco) - Wednesday, March 11

Spark or Hadoop: Is It an Either/Or Proposition? (Santa Monica) - Thursday, March 12

Myriad: Integrating Hadoop into the Datacenter (San Francisco) - Thursday, March 12

Spark DataFrames and ML Pipelines for Large-Scale Data Science (San Francisco) - Thursday, March 12


Spark Ecosystem & Spark Streaming Fundamentals (Bellevue) - Wednesday, March 11

Flume and Spark for Real-Time Ingest and Streaming (Seattle) - Thursday, March 12

Texas Enterprise SQL at Hadoop Scale (Houston) - Wednesday, March 11


Apache Drill Overview & Demo (Minnetonka) - Wednesday, March 11


Vowpal Wabbit, Text Mining and Analysis with R and Using R with Hadoop (Grand Rapids) - Wednesday, March 11


Big Data Security Analytics with Apache Spark and GraphX (Vienna) - Tuesday, March 10

District of Columbia

Intro on Drill: Self-Service Data Exploration & Nested Data Analytics on Hadoop (Washington) - Tuesday, March 10

Hadoop with Python (Washington) - Tuesday, March 10


Cloudera Sessions: Kickstart Your Data Journey (Pittsburgh) - Thursday, March 12

New York

Hadoop in 2015: Implications for Data Professionals (New York) - Tuesday, March 10

Simplifying the Lambda Architecture (New York) - Thursday, March 12


Big Data and Hadoop: Just the Basics (Vancouver) - Thursday, March 12

Apache Mesos, Apache Hadoop, Apache Spark + Custom Enterprise Applications (Toronto) - Thursday, March 12


All You Can Eat with Hadoop (Bilbao) - Tuesday, March 10


Hands-On Spark (Paris) - Tuesday, March 10


Apache Spark: First Meetup! (Berlin) - Thursday, March 12


Introduction to Machine Learning with Spark (Zagreb) - Thursday, March 12


Building Real-Time Applications Using Spark (Bangalore) - Saturday, March 14

MapReduce along with Amazon EMR (Hyderabad) - Saturday, March 14


Spark Meetup (Shanghai) - Saturday, March 14

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit