Data Eng Weekly

Hadoop Weekly Issue #221

25 June 2017

Several more articles on proprietary tools than I usually cover, but there are really interesting things to note from Google, Qubole, and Amazon Web Services. The new Hortonworks Streaming Analytics Manager looks neat, and there are a couple of articles on Kafka. Finally, the slides and videos from two recent conferences—DataWorks and Berlin Buzzwords—have been posted.


The Google Cloud Big Data blog has a post on common use cases for Cloud Dataflow. While some of the content is Google Cloud-specific, the patterns and the psuedo-code presented is largely general purpose—and it's interesting to see how Cloud Dataflow solves various problems.

The Qubole blog describes how they keep the Hive Metastore's statistics up to date. They've implemented a custom MetastoreEventListener to detect new tables, partitions, etc and kick of Hive ANALYZE commands to compute statistics. There are a few improvements over the naive solution—particularly throttling and batching of commands that involve multiple partitions.

Hortonworks has an in-depth look at the new Streaming Analytics Manager. The post describes the main components—service pools and environments—and describes how to build an application using the String Builder canvas. There's integration with the Hortonworks Schema Registry to automatically detect the schema from a Kafka topic and builtin support for common streaming processors like joins, projections, and aggregations.

Qubole has a jump start on a lot of other vendors when it comes to big data as a service. This post writes about one of their cloud-specific differentiators—Container Packing. Based on the YARN fair scheduler, container packing is a mechanism to improve the scale-down capabilities of a auto-scaling cluster to ultimately keep costs down. The post describes the high-level algorithm for container packing, and how to enable it in Qubole.

The Cloudera blog has an overview of various strategies for managing offsets when running Apache Spark Streaming jobs based on data in Apache Kafka. The post includes code for saving and loading offsets to Apache HBase, Apache ZooKeeper, and Kafka.

This post provides an overview of how to use Kafka for streaming ETL. The tutorial uses Kafka Connect for extracting data from a relational database (including a simple transformation), running a Kafka Streams application, and then loading database to another database (once again) using Kafka Connect. The post has lots of code (which tends to be mostly configuration) and an overview of what each of these pieces is doing.

In this post, SparkR is used to parallelize a Markov Chain Monte Carlo calculation to improve runtime from 48 hours on a single machine to 45 minutes on a 50 node Spark cluster. The post has some a few good tips and highlights some gotchas related to SparkR.

The AWS Big Data blog has a post on best practices for Amazon Redshift Spectrum ( using Redshift to query data in S3). Among the recommendations, are suggestions for when to use Spectrum vs. Athena and how to allocate data between Redshift local storage and S3.

A fix for a Linux CVE is causing some Hadoop daemons to crash on startup. This post describes how to workaround the issue.


Roaring Elephant is a bi-weekly podcast about the Hadoop ecosystem. The latest episode covers Apache Zeppelin.

SearchDataManagement has a good analysis of the IBM and Hortonworks deal, including an analysis of what both companies get out of the agreement.

Videos and slides from the DataWorks Summit and Berlin Buzzwords have been posted online.

The Confluent's Log Compaction has coverage of the upcoming Kafka 0.11.0 release, which will have exactly one semantics via an idempotent producer (among other things). The post also has links to a number of great Kafka-related blogs and presentations.

Confluent and is offering a free (behind an email/phone num wall) preview edition of "Kafka: The Definitive Guide."


Cloudera has announced that their Cloudera Altus platform-as-a-service offering is getting support for EC2 spot instances.

Apache Flink 1.3.1 was released. It includes a number of bug fixes, improvements to documentation, and more.

Apache HBase 1.1.11 resolves 20 bugs over the previous patch release. The announcement includes a list of notable fixes.

Apache Pig, which had its last release just over a year ago, just announced version 0.17.0 with support for Pig on Spark. This version of Pig requires a 2.7.x release of Hadoop.


Curated by Datadog ( )



Distributed Deep Learning on Apache Spark w/ BigDL (Palo Alto) - Monday, June 26

Integrating Real-Time Video Data Streams with Spark and Kafka (Culver City) - Thursday, June 29


IoT: Applying Machine Learning to Real-Time Sensor Data on Spark and Kafka (Madison) - Tuesday, June 27

North Carolina

CHS: Microsoft HDI as a Big Data and Interoperability Platform (Charlotte) - Thursday, June 29


Big Data Journey: Getting Up and Running with Apache Spark (Reston) - Monday, June 26

Event-Driven, Fault-Tolerant Microservices Using Kafka (Richmond) - Wednesday, June 28

New York

[QCon Meetup] Survival of the Fittest: Streaming Architectures (New York) - Monday, June 26

[QCon Meetup] Papers We Love w/ John, Matt, Charity, and Gwen (New York) - Monday, June 26


Toronto Apache Spark #21 (Toronto) - Wednesday, June 28


Stateful Stream Processing (London) - Tuesday, June 27


Big Data & Data Science: June Edition (Montpellier) - Thursday, June 29


Scheduling Workloads with Apache Airflow + Running Spark on Google Cloud (Gent) - Thursday, June 29


Rethinking Stream Processing with Apache Kafka (Munich) - Wednesday, June 28


Building Streaming Applications with Kafka (Vienna) - Tuesday, June 27


Apache Nifi for Hortonworks Distribution (Pune) - Saturday, July 1


Data Community Meetup (Colombo 2) - Tuesday, June 27


Data-In-Motion: Recent Advances in Apache Projects for Streaming Data (Canberra) - Thursday, June 29

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit