Data Eng Weekly

Hadoop Weekly Issue #159

28 February 2016

The theme of this week's newsletter is stream processing—from streaming SQL to streaming in Spark 2.0 to Spark Streaming with Amazon Kinesis to IBM's Quark framework. Atop of that, there's interesting content covering Airbnb's data infrastructure and Drillix as well as lots of releases (Apache HBase, CaffeOnSpark from Yahoo, and more). And finally, as the month comes to a close, there are two CFPs ending in the next 48 hours.


The Apache Calcite project has been working to add support for streaming SQL, which could be integrated with systems like Apache Samza, Apache Flink, and Apache Storm. This presentation introduces some of the core concepts for streaming SQL, such as windowing and grouping, and gives several examples of streaming SQL queries (including a few which join against regular tabular datasets).

On the topic of real-time stream processing, this presentation describes the future plans for stream processing in Spark 2.0. The new version will introduce the notion of infinite DataFrames to provide a streaming API using the Spark SQL engine. The slides introduce the model, give examples of ETL and a streaming page view count, describe more about the solution, and give a timeline for Spark 2.0/2.1+ features.

In another post about the upcoming Spark 2.0 release, this article covers the various memory, code generation, and vectorization optimizations planned. The post recaps the motivation and describes in detail the main components of Tungsten, the initiative for implementing these improvements.

In another SQL presentation, this time on non-streaming SQL, Apache Drill and Apache Phoenix are combined into a new system called Drillix. Drillix is powered by Apache Calcite (which is used by both Drill and Phoenix) and the recently announced Apache Arrow. The bulk of the presentation covers Drill—both the user-facing features and the underlying technology components. But the slides also give an introduction to Calcite, Drillix, and Arrow.

The primary goal of Apache Arrow is to make data interoperable across tools and languages. This post talks about how Arrow might help improve performance and interoperability of the Python framework pandas.

Airbnb has written a long post about the current state and evolution of their data infrastructure. It mentions the tools they've previously used (such as EMR, Mesos and EBS for HDFS storage), why they moved away from those systems (cost, efficiency, lack of visibility), and the new tools and systems they're using (such as Airpal, Presto, and Kafka). There's also insight into AWS-specific details, such as networking configuration (run in a single availability zone) and AWS instances (r3.8xlarge instances for Spark and d2.8xlarge instances for HDFS).

Cloudera has a post about using Cloudera Director, their framework for scaling Hadoop clusters in the cloud, as part of a data pipeline. Specifically, there are example scripts (using wget and jq) for starting a cluster, running a job, and terminating a cluster.

If you're using Ambari Metrics, this post is a useful guide for tuning the collector and other components for a production deployment.

The AWS blog has a post introducing Spark streaming with Amazon Kinesis. The article covers topics like building a Spark cluster using EMR, dynamic resourcing, failure recovery (checkpointing to DynamoDB), and more. There's lots of sample code included to help get started on a new project.


The call for papers/presentations for two upcoming conferences ends this week. The CFP for HBase closes on the 28th and for Spark Summit on the 29th.

MapR announced a few changes and additions to its management team this week. In the context of some of the changes (such as a new CMO), the post describes how MapR sees and positions itself in contrast to other vendors like Cloudera and Hortonworks.

This post summarizes what to expect in the forthcoming Spark 2.0. The two main features are (as mentioned in technical posts above) a new stream processing framework and Tungsten for more efficient memory management. Another important part of the release is various housekeeping improvements—for example cleaning up and consolidating APIs.

IBM has submitted Quarks to the Apache incubator. Quarks is a new type of stream processing system built for Internet of Things. In particular, it supports pushing computation to the "edge" components, such as embedded devices.


Yahoo has open sourced CaffeOnSpark, a framework for deep learning on Spark. CaffeOnSpark's APIs support DataFrames and provide a number of built-in data sources (such as LMDB and images stored in sequence files). Along with demoing the API, the introductory post shows how to run a CaffeOnSpark job using the spark-submit command-line tool.

Amazon has announced that Amazon EMR now supports EBS-backed instances, such as the next-gen M4 and C4 instance families.

Last week, I mistakenly stated that Apache Accumulo 1.7.0 was released. The new release was actually 1.6.5, and this week there's also a 1.7.1 release. Both are maintenance releases resolving 55 and 150 issues, respectively. The 1.7.1 release has some important fixes, which are noticed in the release note highlights.

WANdisco has announced Active Migrator for Google Cloud Dataproc. The system provides continuous replication from a data center to the Google cloud.

Google's Cloud Dataproc, its Hadoop and Spark as a service, is out of beta. In the announcement, Google touts the low-cost, speed, and management aspects of Cloud Dataproc.

Apache HBase 1.2.0 was released this week. The release adds support for JDK8, support for Hadoop 2.6.1+ and 2.7.1+, dynamic configuration reloads, and much more. In all, there are over 600 issues resolved as part of the release.


Curated by Datadog ( )



Hadoop Daylong Training and Networking Lunch (Palo Alto) - Monday, February 29

Spark as Part of a Hybrid RDBMS Architecture (San Francisco) - Tuesday, March 1

March Impala Meetup (San Francisco) - Tuesday, March 1

Spark as Part of a Hybrid RDBMS Architecture (Los Angeles) - Wednesday, March 2


Stream Processing with Apache Apex: Next Gen Native Hadoop Big Data Platform (Tempe) - Wednesday, March 2


Large Scale ML Workshop #3: Hive and Mahout (Austin) - Wednesday, March 2


St. Louis Hadoop Users Group Meetup (Saint Louis) - Wednesday, March 2


Real-Time Recommendations Using Spark (Chicago) - Tuesday, March 1


Hadoop Next Gen: Using Cloudera, Kudu & Mozilla Rust to Crunch Big Data! (Troy) - Tuesday, March 1


Spark Meetup 2016 #1 (Mason) - Wednesday, March 2


Extending the Yahoo! Streaming Benchmark & Winning Twitter Hack-Week with Flink (Vienna) - Tuesday, March 1

New York

Extending the Yahoo! Streaming Benchmark & Winning Twitter Hack-Week with Flink (New York) - Thursday, March 3


Spark Summit East Rewind (Vancouver) - Monday, February 29


Spark Meetup (London) - Wednesday, March 2

WTF Is Flink? (London) - Thursday, March 3

Data Enthusiasts, London v 5.0 (London) - Thursday, March 3


Data Munging with Spark, Part I (Toulouse) - Thursday, March 3


Graph Fun with Apache Flink & Neo4j (Berlin) - Wednesday, March 2

Munich Datageeks (Munich) - Thursday, March 3


Kick Off Spark Budapest! (Budapest) - Monday, February 29


Lambda Architecture, Apache Kafka, GraphX (Cluj-Napoca) - Wednesday, March 2


Big-Data Architecture & Spark on AWS (Tel Aviv-Yafo) - Wednesday, March 2


Big Data Ingestion with Kafka & Windowing in Apache Apex for Streaming Apps (Pune) - Wednesday, March 2