Data Eng Weekly

Hadoop Weekly Issue #222

03 July 2017

While this week's issue is a day behind schedule, it should be worth the wait. There's lots of great content on stream processing—from Google Cloud Dataflow to Apache Kafka to Twitter's Heron. There are also excellent posts on SIMD in Presto, YARN support for cgroups, and securing Hadoop with LDAP accounts... and much more!


There's lots of news this week on the new exactly-once semantics in Apache Kafka. This month-old post looks at another exactly-once implementation—from the Google Cloud Dataflow team. It describes some of the key problems and solutions, including how Dataflow handles non-determinstic user code.

One of the advantages of columnar formats like Apache Parquet and Apache ORC is that CPUs can vectorize operations using SIMD when processing entire chunks of a column. This post from the Prestodb blog gives an overview of SIMD and walks through how the JVM optimizes (or not!) a for-loop to take advantage of the optimization.

The Cloudera blog has an in-depth look at how to use deeplearning4j with Apache Spark. The majority of the post describes the setup, which includes applying a pre-trained model from the ImageNet competition.

This tutorial describes how to use the conda python distribution with Apache Spark to take advantage of python libraries like nltk and numpy.

This post gives a good overview of the trade-offs proposed in the CAP theorem, and it describes how Cockroach achieves CP without making a major sacrifice in availability.

The morning paper has two posts this week on Twitter Heron, the stream processing framework from Twitter. The first covers the process of open-sourcing Heron, including the high-level architecture, modularization, and a comparison with Storm and Spark Streaming. The second is on Dhalion, a system that aims to self-tune a streaming job to match a service level objective, to be auto-scaling, and to be self-healing.

S3Guard is a tool that uses DynamoDB to store metadata about files in S3 to provide a consistent view of data (otherwise jobs over data in S3 can end up processing the wrong set of input files). This post looks at the nuance of problem it solves and how to enable it in Hortonworks HDP.

Google Cloud Dataflow has announced a new "service-based" shuffle, that moves the shuffle operation out of the worker VMs and into a managed service. This speeds up execution of jobs by as much as 5x and requires less tuning by the developer.

The IBM Hadoop Dev blog has a good post on how Apache YARN leverages Linux cgroups. It discusses soft and hard limits, configuration for vcores, enabling YARN's cpu scheduling, and several common scheduling scenarios.

The major new feature of Apache Kafka is exactly-once semantics that are made possible by an idempotent producer and a new transaction API. A post on the Confluent blog has an overview of the key challenges/failure scenarios, the details of how Kafka implemented the necessary features, the development process that lead to the new features, a performance analysis, and more.

The AWS big data blog has an epic tutorial that describes hooking up DynamoDB with Kinesis, Lambda, Athena, and Quicksight. By mirroring data using Kinesis Streams and Lambda to S3, Athena can read the data for analysis.

If you're a distributed system geek or just want to read more about the exactly once features that were added to Kafka, these are two more great posts about the nuance of "exactly once" and why the accomplishment is so impressive.

This post gives a good overview of how to configure Hadoop Security with LDAP groups. It includes how to use ldapsearch to figure out the various configuration settings as well as how to specify filters and attribute mappings. The post has a few examples of how to create users and more.


At DataWorks Summit, there was an executive panel at the Women in Big Data Luncheon. This post has a quick recap of the event including great advice from several of the panelists.

The schedule for the upcoming Kafka Summit, which takes place in August in SF, has been posted online.

On the heels of DataWorks Summit in San Jose, there early bird pricing for the next conference—in Sydney—is about to end. The conference takes place on September 20-21.


The team at Cask has announced a new CDAP Cloud Sandbox on AWS, which is a fully functional CDAP limited to a single node.

kafka-streams-clojure implements the transducers interface. The library is alpha-quality and under development.

This is the official release announcement for Apache Kafka There are a number of new features, including record headers, per-connector/task classloaders, exactly-once semantics for streams, and a number of improvements.

Version 1.1.0-inclubating of Apache Fluo was released. Fluo is an implementation of Google's Percolator for Apache Accumulo. The release improves scalability, the Spark integration, and a new API for configuring observers.


Curated by Datadog ( )



Push the Limits of Kafka & Streaming Analytics With Hierarchical Temporal Memory (San Francisco) - Thursday, July 6


Spark Talks: Introduction to Spark Deployment (Austin) - Wednesday, July 5


Data Science User Group Monthly Meetup (Champaign) - Friday, July 7


NLP in q + Hardening Kafka for New Use Cases (Montreal) - Tuesday, July 4

IRELAND Stream Processing with Apache Kafka & Real-Time Data Integration at Scale (Dublin) - Tuesday, July 4


Inside Core Infra (Amsterdam) - Thursday, July 6


Apache Flink: Stateful Stream Processing + Flink CEP Library (Warsaw) - Monday, July 3


Migrating to Spark 2.0, Part 2 (Bangalore) - Saturday, July 8

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit