Data Eng Weekly

Hadoop Weekly Issue #232

10 September 2017

While last week's issue had a bunch of releases, this week is full of great technical content covering Apache Beam, Apache Spark, Apache BookKeeper, Apache Cassandra, Apache Kafka, Heron, AWS Glue, and more. If you can only catch one or two articles this week—the Beam and Kafka at NYTimes articles are especially great reads (but really, it's all great stuff!).


This post provides a great overview of the Apache Beam API. It shows how to use the APIs for stateful stream processing to make a batch of RPC calls to enrich events and to respond to an event or process time-timer. The API is straightforward and there are a number of graphics to explain the concepts. Interestingly, the event time timer can be used to process both historical and real time time.

Streamlio has a two-part introduction to Heron, the stream processing engine. Built with compatibility to Apache Storm in mind, Heron has similar semantics but adds a high-level functional API, supports several scheduling modules (including Mesos, YARN, and Kubernetes), has configurable semantics (at-most once, at-least once, effectively once), has operational maturity, and more.

This post looks at how the Spark DAGScheduler uses data locality to schedule tasks. When the preferred locality isn't achieved, a task can be scheduled on a different node, but only after the spark.locality.wait time has expired. For time-sensitive streaming applications, this can be an important setting to tweak.

Great introduction to Apache Cassandra, including the data model, a comparison to RDBMS, consistency tradeoffs, and several example applications (including source code from insight data science projects).

StreamSets has a walkthrough of running the StreamSets Data Collector using docker.

The Confluent blog has a great piece from the New York Times about their new "monolog" that stores all content ever published by the newspaper and website. While the monolog is a normalized, single-partition Kafka topic, there is also a denormalized topic which is multi-partition and can be used to build up things like ElasticSearch indices. There's a lot of talk of immutable logs and the kappa architecture, but this is one of the best articles about a non-trivial use case.

This post has a look at Single Message Transform (SMTs) Kafka Connect applications. SMTs are configuration-based applications, setup entirely in JSON config files, that can do simple operations like extracting values, mask fields, insert lineage information, and more.

This post has a great collection of books, papers (including seminal ones such as Amazon's Dynamo and Google's Chubby), and industry posts. It also has great tips on how to decide whether or not to read a paper and how to read one once you've chosen to go for it.

This tutorial describes how to use AWS Glue with Amazon Kinesis to implement the lambda architecture (hybrid batch and stream processing). Glue, as you may remember, is built on Spark and stores metadata about data sets into a metastore. Thus, data that lands in S3 can then be queried using other AWS services like Athena.

Streamlio has been busy writing great articles on the technology in their stack. This post has a look at BookKeeper, which is a storage system built atop of ZooKeeper primarily for use as a write-ahead log. The article provides an overview of BookKeeper including keep concepts and terminology, the APIs for interacting with BookKeeper, and a typical deployment.

This post describes Stocator, which is a high perf object store connector for Spark. The post also discusses the state of blob store support in Hadoop and describes the limitations of swift and s3a clients due to inherent consistency tradeoffs in the backing systems.


Cloudera announced second quarter earnings this week as well as the acquisition of Fast Forward Labs.

MapR, who had been expected to go public, has announced that they've taken $56 million in a round led by Lightspeed Venture Partners. MapR also announced strong growth in quarterly billings.

The first batch of speakers for DataEngConf have been announced. The conference takes place on October 30-31 in New York, and the early bird pricing ends on September 22nd.

The Call For Papers for Big Data Tech Warsaw, which takes place in February, closes on October 16th. The conference has four tracks and expects over 500 attendees.


Version 2.2.0 of Apache Bahir extensions for Apache Spark adds connectors for Akka, Apache CouchDB, and Google Cloud Pub/Sub.


Curated by Datadog ( )



Stream Processing with Apache Kafka & Apache Samza (Sunnyvale) - Tuesday, September 12

Apache Ignite: The In-Memory Hammer in Your Data Science Toolkit (Mountain View) - Wednesday, September 13

Laying Down the SMACK on Your Data Pipelines (San Francisco) - Thursday, September 14


Overview of Apache NiFi (Austin) - Wednesday, September 13

PySpark Workshop by Meghann Agarwal (San Antonio) - Thursday, September 14


Applying Machine Learning to Real-Time Sensor Data on Spark and Kafka (Kansas City) - Tuesday, September 12


Cleveland Big Data Mega Meetup (Cleveland) - Monday, September 11


Divide, Distribute, and Conquer: Stream vs Batch (Malvern) - Wednesday, September 13

IRELAND Spark and Hadoop in Risk Line of Business at Bank of America (Dublin) - Thursday, September 14


Spark Streaming: Writing a Machine Learning Streaming App with Spark 2.2 (Bristol) - Tuesday, September 12


Kafka Meetup (Neuilly-Sur-Seine) - Wednesday, September 13


Kafka Streams and More (Berlin) - Thursday, September 14


Apache Kafka from Zero to Hero (Tel Aviv-Yafo) - Wednesday, September 13