Data Eng Weekly

Hadoop Weekly Issue #212

16 April 2017

After a week off, there's a lot of content in this issue. Streaming is once again a hot-topic with articles on Apache Flink, Apache Kafka, and Apache Spark. There are also great posts on Apache Hive's LLAP and Apache HBase's new 'accordion' memory management. In releases, there are quite a few patch releases as well as new releases from ODPi and Hortonworks.


Using PostgreSQL 9.4+'s logical decoding for change data capture (CDC), the team at Simple streams data to Kafka and ultimately to Amazon Redshift. Additionally, applications can do async processing based on data change events by subscribing to the CDC queue in Kafka. The post covers the architecture in depth, how their implementation relates to bottled water, and what their plans are for future enhancement.

The Databricks blog has a post demonstrating the Spark Structured Streaming APIs, with an emphasis on performing windowed aggregations on data in Kafka. The post also shows how to write the data out to a few different sinks (including Kafka) after processing.

The Hortonworks blog has a tour of some of the features of Hive's LLAP that improve performance, including a bloom filter-based filtering, dynamic text caching (covering of csv/json data to in-memory optimized format), SSD caching, decimal vectorization, and SIMD optimizations.

The Apache Flink team has written about Dynamic Tables, which are a table-based version of a data stream. Unlike an output record for a streaming job, a value in a Dynamic Table can be updated (somewhat like a materialized view in a database). The post has a lot more details, including how Dynamic Tables relate to current APIs.

The Apache HBase team has introduced a new compacting memstore (nicknamed "accordion"), that optimizes HBase's memory usage. The compaction algorithm is similar to what is done with HFiles—the first article gives a brief overview of what this means and what the types of performance improvements you might see are. The second looks at the internals of the implementation and does a deep dive of the technical details.

The Netflix blog has an overview of some features recently added to their open-source workflow tool, Netflix Conductor. These features include inversion of control (the ability to build up workflows from subworkflows), new event-based triggers/tasks, and new built-in support for JSON Path data transforms.

The Hortonworks blog has an overview of the SparkR architecture and walkthrough of using SparkR (from an R programmer's perspective) to crunch on datasets in a SparkR DataFrame. The post includes an example of using gapply to apply a user defined function.

This post describes a new feature in the Apache Spark scheduler to blacklist certain nodes that have unexplained errors. In particular, this solves an issue where a single node with intermittent disk failure could cause an entire job to fail.

The StreamSets Data collector has a bunch of built-in functionality for transforming data. This post looks at transforming JSON data with the Field Pivoter, Field Flattener, Field Renamer, and Field Splitter.

This post describes the Apache Kudu read and write paths at a high-level, including things like Tablet discovery and the inner workings of the memory and disk data stores.

The OpenStreetMap dataset is publicly available on Amazon S3 as an ORCFile, and this post describes how to query it with Amazon Athena. There are a number of example queries inline and in a companion gist.

Pandora is using Kafka as the transport layer for their advertising (and other) data, using Kafka Connect and the Confluent Schema Registry to write data to HDFS. The post describes the architecture and some of the operational concerns, including the how they monitor the pipeline and what types of throughput they see when writing data to HDFS.


This post has a look at the big data ecosystem from the VC lens. It describes trends in marrying big data with artificial intelligence, IPOs (and signs that there's still plenty of private money too), moving to the cloud, consolidation, and more.

Kafka Summit NYC is just under a month, and this post describes what to expect in the Streaming Pipeline Track—including talks by folks at Airbnb, Yelp, and Ancestry.

The speaker line up for DataEngConf, which takes place in San Francisco April 26-28, has been announced.


ODPi 2.1 has been released. It's the first version that is using Apache Bigtop as the reference implementation, and the introductory blog post has more commentary on the release, including the introduction of a "Spark and Fast Data Analytics Special Interest Group."

Hortonworks Data Platform 2.6 has been released. It includes GA support for Apache Hive's LLAP and ACID data merge, security enhancements to Apache Zeppelin, improvements to Spark, and more. The release also adds support for IBM Power Systems—more details can be found on the second article below.

Hortonworks Data Cloud also has a new release with support for HDP 2.6, auto-scaling of cluster compute, node auto-repair, and more.

Apache Hive 1.2.2 was released. The bug-fix release includes over 40 fixes and a small number of improvements/new features.

Apache Apex Malhar 3.7 was released. The new release adds a number of new docs to the library that is central to building an application on the Apache Apex system.

Kafka-utilities is a new open-source project with support for tracking consumer group lag and finding in-sync replicas.

Apache Accumulo 1.7.3 includes an improvement for Tablet Server performance and fixes a number of bugs. The blog post has many more details on the release, including a link to the full set of resolved issues.

Apache Twill, which is an abstraction on YARN for building distributed applications, version 0.11.0 was released.

Apache Cassandra 3.0.13 was released. The release includes, for the first time, RPM packages and a Yum repository.


Curated by Datadog ( )



Security and Disaster Recovery for Your Hadoop Clusters (San Francisco) - Tuesday, April 18

Building Real-World Big Data & Machine Learning Systems (Santa Clara) - Wednesday, April 19

#SDBigData Meetup #21 (San Diego) - Wednesday, April 19

Overview of NoSQL Databases: MongoDB, HBase, Cassandra, and Jaguar (San Jose) - Wednesday, April 19

Building Faster Streaming Applications with Apache Storm 1.1 (Santa Clara) - Thursday, April 20


Talks from the Four Comma Club (Bellevue) - Tuesday, April 18

Lambda Streaming, Overkill Analytics, BigDL with Spark (Bellevue) - Thursday, April 20


Big Data at Uber (Louisville) - Thursday, April 20


Spark 2.0 pySpark Made Easy (Irving) - Wednesday, April 19

Learn about Hive LLAP (Houston) - Thursday, April 20

Fundamentals of Spark for Cassandra (Austin) - Thursday, April 20

Gluent and Hadoop (Addison) - Thursday, April 20


Apache Spark Workshop (Miami) - Saturday, April 22

IRELAND Inaugural Kafka Meetup (Dublin) - Thursday, April 20


Learning Hadoop the Hard Way (Manchester) - Wednesday, April 19

The Rise of Real-Time, with Neha Narkhede (London) - Wednesday, April 19


First Lisbon Apache Kafka Meetup (Lisbon) - Wednesday, April 19


Distributed Computation: Dask, Spark, or Hadoop? (Paris) - Tuesday, April 18

Big Data & Data Science (Montpellier) - Tuesday, April 18

Apache Kafka and The Rise of Real-Time, with Neha Narkhede (Paris) - Thursday, April 20


Apache Flink Zurich Kickoff (Zurich) - Tuesday, April 18

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit