Data Eng Weekly

Hadoop Weekly Issue #217

21 May 2017

Several great technical posts including coverage of Spark, Kafka, Druid, and Hive. There's also a link to the recent paper on Google's Spanner and a new tool for writing streaming applications against data in Kafka with Go. Finally, there are several new releases this week, including Samza, Beam, and Spring Cloud DataFlow.


This post provides a walkthrough of using Amazon EMR with Apache Spark. It shows how to use both the command line and the AWS console to spin up a cluster and run some Spark jobs.

The Hortonworks blog has a post that motivates the need for a shared schema registry, especially for streaming applications. They are planning on shipping, as part of the next HDF release, their own schema registry that will eventually integrate with Apache Atlas and Apache Ranger in addition to Kafka.

Cloudera has recently integrated Apache Kafka, Apache Spark, and Apache Ranger to provide encryption and authorization for Spark jobs interacting with Kafka. This post describes the implementation, and why some of the design decisions were made the way they were.

The Hortonworks blog has a post that demos the row and column-level access control and data masking support in Hive and Spark SQL via Apache Ranger.

After last week's paper from the Amazon Aurora team, this week I'm including a link to the Google Spanner paper. Published as part of the SIGMOD '17 proceedings, this paper focuses on the query planning and execution components of Spanner (vs. the original paper that focussed on its fault tolerance, scalability, etc).

The Pyrocast blog has a post on event stream processing. It describes some key insights into the logic behind the push to an event log for capturing state vs. a relational database. The post introduces the notion of "Data Loss by Design" and points out that a traditional database doesn't let you scale reads independently from writes (among other limitations). With a log-centric architecture, though, "each query gets it own schema" and reads can be decoupled/scaled separately from writes.

The Databricks blog has a post on some of the operational concerns for Spark's Structured Streaming, particularly related to observability. There are new APIs to get the status, get recent progress (including input and processing rates), and the ability to track progress via callback. The post also has a quick overview of alerting, failure recovery, and updates.

This post describes how to build an OLAP table in Druid from data in Apache Hive. It's the second part in a series on integrating Hive with Druid, and the first one has some more context for when or why this might be a good idea.


The Data Platforms 2017 conference is this week in Phoenix. A press release on the Qubole blog mentions several of the speakers.


Goka is a new project for writing stream processing applications in Go backing them with data in Kafka.

Apache Flink has published Docker images to Docker Hub. The companion documentation has examples of multi-node clusters using Docker Compose and Docker Swarm.

Version 0.13 of Apache Samza has been released. Samza was an early entry into the distributed stream processing space, but it has had a more low-level programming model than other systems. This release adds a new high=level API, support for rolling upgrades, improved failure detection, and more.

Spring Cloud DataFlow is a data processing system that supports both streaming and batch. There are a number of connectors and features, including support for the reactive programming model and new experimental support for Kafka Streams in the 1.2 release. Lots more features to learn about in the release notes.

Version 1.1.0 of Apache CarbonData, the columnar data format, has been released. Major features of the release include a new V3 data format that improves scan performance and a batch sort operation to speedup data load.

Apache Beam 2.0.0 was released this week. Highlights of the release include support for stateful data processing, support for file systems (built-in support for HDFS), and a metric system. The major version also signifies API stability for all future releases on the 2.x branch.


Curated by Datadog ( )



Apache Kafka and the Rise of Real-Time, with Neha Narkhede (Hollywood) - Monday, May 22

Stream Processing with Apache Kafka and Apache Samza (Sunnyvale) - Wednesday, May 24

Unified, Efficient, and Portable Data Processing with Apache Beam (Santa Clara) - Wednesday, May 24

Netflix's Performance Optimization of Recommendation Pipeline Using Spark SQL (Los Gatos) - Thursday, May 25

North Carolina

Building Next-Generation Systems of Record with MapR Streams (Charlotte) - Thursday, May 25


Stream Analytics with SQL on Apache Flink (London) - Tuesday, May 23

DevOps & Data (Glasgow) - Wednesday, May 24

Kubernetes and Kafka for Fun and Profit (London) - Wednesday, May 24

TensorFlow + Kubernetes + Spark + ElasticSearch + Beam + BigQuery + Dataflow (London) - Wednesday, May 24

Spark Strata Highlights! (London) - Thursday, May 25


The Discovery of Kafka (Toulouse) - Tuesday, May 23


"Fast Data in Supply Chain Planning" by Jeroen Soeters (Rotterdam) - Tuesday, May 23


"Fast Data in Supply Chain Planning" by Jeroen Soeters (Halle) - Wednesday, May 24


Big Data Analytics in Real Time (Tel Aviv-Yafo) - Wednesday, May 24

Kafka & Mongo: The Master of Clusters (Tel Aviv-Yafo) - Wednesday, May 24


Kafka Meetup @ Linkedin (Bangalore) - Saturday, May 27

Spark Streaming (Bangalore) - Saturday, May 27

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit