Data Eng Weekly

Hadoop Weekly Issue #244

17 December 2017

This week's issue is a big one—both in the content (12 technical articles!) but also because Apache Hadoop hit version 3.0. Apache Flink, Apache HBase, and Apache Knox also had releases, and there are great posts on Apache Kafka, Heron, Apache BookKeeper, and stream processing at CERN.


Netflix has developed a new Heterogeneous Cluster Allocation strategy to replace Uniform Consistent Hashing in their content distribution clusters. The post describes both strategies—Consistent Hashing is used in many distributed systems such as Apache Cassandra and provides advantages when node membership (failure, addition, etc) changes.

Apache Kylin is an OLAP database system for big data. It supports JDBC drivers, which can be used to run queries from Hue, including on an Amazon Elastic MapReduce setup. This post includes the basic steps to get going.

In an interesting take on a distributed system, this author describes data copying and decoupling as the main properties of a distributed system.

The Heron streaming library has a new Streamlet API to support a higher-level programming model. It's a functional API that feels familiar if you've used Scala sequences or Java Streams, and it's (as demonstrated in the inline code) a much more concise mechanism of defining a streaming program.

The Confluent blog has recently had several articles about the exactly once semantics in Apache Kafka. In the latest post in their series, they describe how the Kafka Streams API achieves exactly once.

CERN's Accelerator Logging Service ingests 100s of GB/day. They've implemented a new architecture based on Apache Kafka, Apache HBase, and Apache Spark.

The Streamlio blog has a post that describes Total Order Atomic Broadcast, the architecture of Apache BookKeeper including its primitives and how it recovers from write crashes, and how Apache Pulsar uses BookKeeper.

The Banzai team has announced a new PaaS tool for CI/CD atop of Kubernetes and AWS. This post describes how to use the tool for CI/CD of a Spark application.

You may have heard of Apache Arrow but might not know too much about it given its use as an internal library. This post describes how Pandas and other libraries make use of Arrow to improve data translation and storage footprint in memory.

As I usually say, you should always validate a benchmark with your own use case, rather than trusting what you see online. This example helps to really drive home that point—a small bug in data production of a benchmark driver program caused a major slowdown for Apache Flink in a competitor's analysis.

This post shows how to track Presto queries on a cluster by implementing an Event Listener class to log the contents of queries. The code is available on github, and there are instructions in the article for how to deploy the custom code via Amazon EMR.

This post provides an introduction to the basic parts of Apache Kafka—brokers, partitions, producers, consumers, and more. There's a good amount of pictures to help illustrate the concepts, too.


Videos from the recent Apex Big Data World conferences have been posted online. This post also features ten presentations that highlight some of the conference's themes.

Starburst is a new company focussing on Presto—both through development of the engine and by providing enterprise support.

There are a few hours left to submit a proposal for Flink Forward, which takes place in April in San Francisco. The CFP ends at 11:59 PT tonight.

The SysML conference is an upstart taking place Feb 16th in Stanford, California. The call for papers is open until January 5th—submissions should be in or across the machine learning and systems areas (there are several suggested topics on the website).

The Stream Processing Meetup @ LinkedIn has posted videos from their December meetup. Speakers were from LinkedIn, Uber, and Slack and spoke on Samza SQL, Kafka performance, and Samza at Slack.

A new episode of the Data Engineering Podcast covers the Confluent Schema Registry.

Apache Hadoop 3.0.0 was released this week (more details below). Datanami has a look at what's ahead for the next couple of releases after it (including a new GPU resource types, docker support, support for long-running services, FPGA support, and an S3-compatible blob store).


Apache Kudu 1.6.0 was released with a number of new features and improvements (including improvements to startup time, improved locality for Spark integration, performance improvements to the Kudu master, and more). There are also several fixed issues—see the announcement for a full list.

Spark-bam is a new library for reading BAM files, a common file format for genomic data.

Apache Flink 1.4.0 was released. It includes improvements to exactly-once applications, the removal of the transient Hadoop dependency, and improvements to Flink internals.

StreamSets Control Hub is a new product from the StreamSets team to help centralize/visualize end-to-end data workflows and to assist in collaboration.

Apache Hadoop 3.0.0 was released—it adds HDFS erasure encoding, a preview of v2 of the YARN Timeline Service, YARN/HDFS Federation improvements, and more.

Apache HBase 1.1.13 is the final release of the 1.1 branch. It includes a number of bug fixes and improvements.

Apache Knox 0.14.0 was released - it's anticipated to be the last before version 1.0. Highlights include Nifi and Livy API proxy support, remote configuration via ZooKeeper, and improvements to Websocket support.

A new version of the kafka-connect-fs library, which supports loading data out of files via Kafka Connect, was released.


Curated by Datadog ( )



Confluent KSQL on Our First Kafka Meetup in Dallas! (Dallas) - Tuesday, December 19


First Apache Airflow Meetup @ Heineken (Amsterdam) - Thursday, December 21


Real-Time Data-Ingest with Apache NiFi for IoT and Streaming Analytics (Frankfurt) - Monday, December 18


Apache Spark (Shanghai) - Saturday, December 23


Our First Kafka Meetup with Guest Speaker from Confluent (Melbourne) - Tuesday, December 19