Data Eng Weekly

Hadoop Weekly Issue #157

15 February 2016

This week's issue has fantastic variety of content covering topics like workflow engines, new Spark features, distributed locking, secondary indexing in Cassandra, and S3 optimizations for Hive. In addition, there are a couple of exciting new tools for Python—a new scikit-learn integration for Spark and a the new hdfs3 library. In news, there are funding/earnings updates on MapR, Hortonworks, and Trifacta.


Three open-source workflow frameworks have popped up over the past several years from Spotify (Luigi), AirBnB (Airflow), and Pinterest (Pinball). This post aims to give an overview and comparison of these three systems across the areas of architecture, scheduling, contrib (i.e. integration with data systems), source, and batteries included.

If you've worked with Hadoop in Amazon S3, you may have experienced slowness due to long job startup times. One common cause of that slowness is S3 "directory" listings—various Hadoop components are optimized for interacting with HDFS. The Qubole blog describes, at a high-level, how they've optimized S3 listings for Hive queries (which can lead to as high as 75x speedups).

Apache Spark 1.6 added the ability to create pivot tables from Spark DataFrames. Pivot tables (which transposes rows to columns and performs aggregation) are a feature of several popular tools (e.g. excel, pandas). This article provides examples of using the new pivot tables for two different use-cases/data sets—sales data and ratings/feature generation. There is also a discussion of tips/tricks and the implementation.

The Google Dataflow framework has been in the news recently as it's been accepted into the Apache incubator under the name Beam. To clear up any confusion with Hortonworks DataFlow (which itself is based on the Apache NiFi project), the Hortonworks blog has a post about the differences between Beam/Dataflow (abstraction layer for compute) and Hortonworks DataFlow (data movement and providence system).

Distributed locking is a common pattern in distributed system. This post describes the Redlock algorithm for Redis (for which the author describes some correctness problems), and it gives an overview of fencing, which is a common pattern used to implement locking (e.g. using ZooKeeper). While the post primarily focuses on Redlock, there are several follow-up posts including one (second link below) that details distributed locks and fencing in ZooKeeper.

The MapR blog has an article summarizing the main components, strengths, and weaknesses of three popular stream processing frameworks: Apache Storm, Apache Spark, and Apache Samza. The content is based on a presentation from last year's Strata+Hadoop World, so there are lots of figures for explaining the key concepts. The article also mentions Apache Flink, which wasn't part of the original presentation but has gained popularity since then.

Cloudera has published some new benchmarks comparing Apache Impala (incubating), Apache Spark SQL, and Apache Hive-on-Tez. Per the usual disclaimer, it's important to try out each system on your own data set (and there are typically also reasons other than performance to take into consideration). But in Cloudera's analysis based on derived TCP-DS queries, they found that Impala is faster both in single-user and multi-user scenarios (Spark SQL was in second place). There are many more details and an analysis of when to use each system in the writeup.

Support for secondary indexes in a distributed database is often very limited—if it exists at all. This is the case in Apache Cassandra, but upcoming support for SSTable Attached Secondary Indexes (SASI) makes major improvements. An introductory blog post describes the motivation, gives background on the implementation, and compares the features of traditional and SASI indexes.

Netflix has created a tool for snapshotting the state of their online system to provide "time traveling" capabilities to train and evaluate machine learning algorithms. Called DeLorean, the system is built on Spark and supports running offline experiments and transitioning to online A/B testing.


CMSWire has an article about Hortonworks' recent quarterly earnings report and announcement of a secondary stock offering. It points out that they have plenty of cash left and that their earnings are up nearly 200% year-over-year.

TechRepublic has an in-depth look at why Apache Spark continues to gain so much popularity in the big data ecosystem. Drawing from previous content and a recent survey, the post notes that the hype may be over as people start to use Spark. The article contains an interview with Syncsort's general manager of big data, in which this and other trends are discussed.

In what is likely one of the first press releases with the phrase "data wrangling," Trifacta has announced a new round of financing. The $35 million of additional capital brings to the total raised to $76 million.

There have been several posts celebrating Hadoop's 10 year mark, and the Cloudera blog has a post with some thoughts about where Hadoop will be after 10 more years. While the software is harder to predict (there are some new projects like the Apache incubator projects Beam and Kudu), the lead time on hardware makes that space a bit easier to predict (e.g. Intel's 3D XPoint).

MapR has released information about their fourth quarter numbers and other highlights: billings are up over 100% annually, dollar-based expansion is up 146%, and their customer retention rate is 99%.


DataBricks has released scikit-learn integration for Spark. The package provides a new execution framework both for local machines and multiple machines without any changes to user code. The introductory post has an example of using a random forest classifier.

Apache Apex 3.3.0-incubating was released this week. The new version adds support for iterative processing, modules, and a new callback API.

Version 0.1.0 of hdfs3, a python wrapper for libhdfs3 (the C/C++ library for HDFS), was released this week.

Apache Flink 0.10.2 was released this week. The release fixes over 20 issues.

BlueData has announced a new version of their automation software based around Kafka, Spark streaming, and Cassandra. In addition to those, the system supports Zeppelin. The announcement post has more details and several screenshots of the BlueData system.

Apache Hadoop 2.6.4 was released this week. It contains fixes across HDFS, MapReduce, YARN, and Hadoop common.


Curated by Datadog ( )



Scalable Elasticsearch-Spark Connector, Spark SQL/DataFrames, DataSource API (San Francisco) - Monday, February 15

Running Spark Clusters in Containers with Docker (Sunnyvale) - Tuesday, February 16

51st Bay Area Hadoop User Group Monthly Meetup (Sunnyvale) - Wednesday, February 17

Stream Processing Architecture and Applications: Apache Apex (Sunnyvale) - Wednesday, February 17

February Samza Meetup (Mountain View) - Wednesday, February 17

Ensuring Data Privacy and Security on Hadoop (Palo Alto) - Thursday, February 18


A Deeper Look into SparkSQL, DataFrames, and Data Sources w/ IBM and Galvanize (Denver) - Wednesday, February 17


Hadoop as a Service, by Ajay Jha (Houston) - Tuesday, February 16

Tim Renner Presents on Streaming with Storm, Spark, and Kafka (Austin) - Wednesday, February 17

Process Information into and across Hadoop at High Speed (Coppell) - Thursday, February 18


Modern Data Architecture Using Flink and Hadoop (Chicago) - Tuesday, February 16

New York

Spark Summit Committer Night (New York) - Tuesday, February 16

Securing Apache Spark on Production Hadoop Clusters (New York) - Wednesday, February 17

Enabling Python to Become a Better Big Data Citizen w/ Wes McKinney (New York) - Wednesday, February 17


Hadoop Meetup (Paris) - Tuesday, February 16


The 2 Worlds of Big Data, Batch & Stream: Use Cases, Challenges, and Solutions (Munich) - Wednesday, February 17


Kudu + Distributed Recommender Engines (Krakow) - Thursday, February 18


Spark War Stories: What's Really Painful (Tel Aviv-Yafo) - Wednesday, February 17

RUSSIA Stream Computing: Developing a Real-Time Analytics (Moscow) - Thursday, February 18


Data Science on H2O Sparkling Water + Apache Spark Performance Tuning (Melbourne) - Tuesday, February 16