Data Eng Weekly

Hadoop Weekly Issue #219

04 June 2017

Lots of great technical posts this week, including several on using Amazon S3 cloud storage and building data systems with Apache Kafka. There's also a post on the Luigi workflow engine, writing Cucumber tests for Spark, and several news/release posts.


The Pivotal blog has a tutorial (with sample code) for building a machine learning pipeline using the Luigi workflow engine. The main functionality in the example is Apache MADlib (incubating), which is executed via PL/pgSQL from the Luigi tasks.

This post describes how to use Apache Apex to write data from Apache Kafka to Apache Kudu. In addition to the basics, the post covers how to implement exactly once semantics, how to handle partial (single column) updates, and some of the operational metrics that are captured as part of the process.

The Databricks blog has two posts on cloud storage. In the first, they describe a number of advantages of using Amazon S3 over HDFS when in AWS—including 5x cost savings and higher availability and durability. The second post is on transactional writes to Amazon S3, the absence of which has often been a drawback. While other cloud services got this functionality previously (e.g. Amazon EMR introduced EMR-FS in 2014), Databricks is getting a new feature to improve on the support available in vanilla Hadoop's S3 implementation.

The MapR blog has an in-depth overview of performance tuning on a real-life application that involves Apache Kafka, Spark Streaming, and Apache Ignite (for caching of RDDs). Improvements include increasing the number of Kafka partitions, fixing an RPC timeout setting, tuning memory of both Spark and Ignite, and modifying the batch interval.

This post makes the case that "Apache Kafka is more disruptive than merely being faster ETL." It highlights several advantages that Kafka brings, including integration between streaming/applications/databases, distributing ETL (rather than a centralized monolith), and scale & reliability.

Datanami has a post describing Pandora's Kafka deployment, which uses Kafka Connect to write Apache Parquet files to a Hadoop cluster. The are making use of the Kafka Schema Registry, and they've written a custom Gradle plugin for migrations. As the post highlights, they've had a positive overall experience despite some issues (e.g. when HDFS is unavailable, the HDFS Sink Connector can corrupt its WAL).

As a nice follow up to the Databricks posts on S3 vs. HDFS, this post describes some of the main features and options of S3DistCP for copying data from HDFS to S3. Based on Hadoop's DistCP, S3DistCP is optimized for S3 and offers features like changing file compression.

This post provides an overview of how to use Apache Spark with Cucumber for automated testing.


The Confluent Log Compaction post includes a preview of what to expect in Kafka New features include exactly-once semantics, a new admin API, and improvements to Kafka Connect and operations.

Apache SystemML, which is a machine learning library that's built for scaling out on Apache Spark and Apache Hadoop, has become a top-level project. The press release includes an overview and quotes from some companies that are using it.

Spark Summit is this week in San Francisco. This post has a preview of several of the keynotes and community talks as well as a promo code for a last minute registration.

In just over a week (June 13-15), the DataWorks Summit/Hadoop Summit takes place in San Jose. This press release has a brief overview of what to expect.

Apache Hadoop CVE-2017-7669 is a privilege escalation in the docker feature that was added to Apache Hadoop 2.8.0 (and other alpha releases). There isn't yet a fix for the 2.8.x line, so mitigation is to disable Docker support.

The team at StreamSets has raised $20 million for product development and expansion in Europe.


Apache Avro 1.8.2 was released. It is a bug fix release (across Java, C++, Py3, and ruby), and there are also a small number of improvements.

Apache Flink 1.3.0 was released with several major areas of improvement. Specifically, there are improvements to recovery (and state handling), the DataStream API, the Table API (SQL), and deployment and tooling (including watermark monitoring the web front-end). More details about the new features can be found at the release announcement.


Curated by Datadog ( )


Pre-Spark Summit Bay Area Apache Spark Meetup (San Francisco) - Monday, June 5

Meet the SMACK Stack Experts AMA Meetup (San Francisco) - Monday, June 5

Spark in Adtech (San Francisco) - Tuesday, June 6

Women in Big Data Luncheon at Spark Summit West (San Francisco) - Wednesday, June 7

Integrated Dataflow Processing with Spark and StreamSets (San Francisco) - Wednesday, June 7

Dr. Elephant Meetup: Spark Summit Edition (San Francisco) - Wednesday, June 7

Just Enough Scala for Spark (San Francisco) - Thursday, June 8

Apache Spark: Cool Magic and ML Futures (San Francisco) - Thursday, June 8


Slim Baltagi Presents: Kafka Streams for Java Enthusiasts (Chicago) - Thursday, June 8

North Carolina

Real-Time Data Processing with NiFi and Kafka (Raleigh) - Thursday, June 8


Data Science, Spark With RapidMiner and Serverless (Paris) - Thursday, June 8

Stream Processing with Apache Flink (Talence) - Thursday, June 8


June Kafka Meetup (Utrecht) - Thursday, June 8


Spark/Flink, Emma, and Kafka Meetup (Berlin) - Wednesday, June 7


KrkDataLink#2: Guerilla Streaming Data Platform (Krakow) - Wednesday, June 7


Big Data Talks and Drinks (Bucharest) - Wednesday, June 7

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit