Data Eng Weekly

Hadoop Weekly Issue #242

03 December 2017

Lots of great articles this week—after skipping last week it's a double issue. A bunch of content on Kafka (both technical posts and a few releases from Confluent), and posts about several different cloud services from AWS, Databricks, Google, and more. MapR 6.0 was announced, and Impala has graduated from the Apache incubator.


This post describes Apache Kafka's transactions in detail—it shows the API (both producer and consumer), details the transaction coordinator, and the data flow of a transaction. It's a good overview of the design as well as how to use them (and what the tradeoffs are) in practice.

Spotify uses Google PubSub as part of their Event Delivery System. This post describes how they scale consumers, which are deployed via Docker and write data to Google Cloud Storage. There are some interesting work-arounds for Docker bugs and how they interact with Puppet/Google Cloud autoscaling.

This post provides a brief overview of Google Cloud Dataprep, which is currently in open beta. Dataprep is an ETL-as-a-Service tool with BigQuery integration.

A good overview of the evolution of data processing frameworks in the past few years, this post looks at the older (e.g. Storm and Samza) as well as newer (e.g. Beam, Spark, and Flink) stream processing frameworks.

The Apache Flink blog has a look at what's ahead in release 1.4 and 1.5, including a new deployment model, improved latency via a reworked network stack, and faster recovery from failures via incremental snapshots.

Qubole has written about a prototype that they've built for running Apache Spark via AWS Lambda. With it, they scan 1TB of data for about $1.18 and can sort 100 GB of data using about $7.50. Lambda doesn't allow functions to directly communicate with one another, so the Qubole team has written a custom scheduler and state store (described in more detail in the post) for Spark 2.1.0. The code is on github, and there's a roadmap of future work.

The Databricks blog has a recap, with slides and video, from the Women in Big Data Meetup. There are talks on stereotypes in CS, Spark performance (including future work to get better insights into bottlenecks), and using Spark to build a Deep Learning pipeline.

RubiX is a caching file system atop of Amazon S3, which offers performance improvements by caching metadata about block locations and other object store details. This post walks through how to use it (source code on github) with Presto, Hive, and Spark.

The AWS blog has a post from NUVIAD about their big data infrastructure for an online ad systems. The full post has a lot of interesting details, but some highlights include that Redshift Spectrum with data in Parquet format is sometimes faster than traditional Redshift, its quite simple to use AWS Lambda and AWS Glue to convert data from CSV to Parquet, and it's important to sort data within a Parquet file by a commonly used key.

This post describes how to run Twitter Heron on Google Cloud Platform with Kubernetes.

This post argues that the CQRS and Event Sourcing models are in many ways similar to functional program (with immutable data, side-effect free functions). This post makes the comparison of this model vs. the traditional model (which is referred to here as object-oriented) by walking through the way data flows, failure scenarios, latency, and more.

Streamlio has an overview of Apache Pulsar (incubating) that highlights how it compares to Apache Kafka. There's a table at the end to summarize the extra features in Pulsar.


The Data Engineering Podcast has an interview with Doug Cutting (creator of Hadoop and Avro) and Julien Le Dem (creator of Parquet). The interview covers data serialization formats, including the development effort, trade-offs of formats, and the future of this area.

Apache Impala has graduated to a top-level project. It had spent just short of 2 years in the Apache Incubator after originally being developed as an open-source project at Cloudera.


MapR 6.0 was announced with new encryption/security features, a new change data capture framework, an improved MapR Control System (their management software), and more.

Apache Bigtop 1.2.1 was released—it includes a Docker provisioner, is built on JDK8 and has upgrades of several ecosystem projects.

Apache ZooKeeper 3.4.11 includes a number of bug fixes and improvements.

Confluent has made a number of announcements over the past two weeks. The biggest ticket items are the general availability of their hosted Kafka solution, Confluent Cloud, and version 4.0 of the Confluent Platform. In addition, an update to the KSQL developer preview is also out. Confluent Cloud provides a 99.95% uptime SLA, and Confluent Platform 4.0 is based on Apache Kafka 1.0 and has updates to the Confluent Control Center.

Version 4.5.1 of Apache BooKeeper was released this week. It resolves a number of critical bugs.

StreamSets has announced version 3.0 of their StreamSets Data Collector (SDC) and a new Data Collector Edge, which is a go-binary that's lightweight. The new version of their SDC has a number of new features, including related to AWS and Google Cloud, MapR, Kafka, Oracle CDC, and more.

Version 1.0.0 of Burrow, the monitoring tool for Apache Kafka, has been released. Since originally announced a bit earlier, there are a number of new features and bug fixes.


Curated by Datadog ( )



Stream Processing with Apache Kafka & Apache Samza (Sunnyvale) - Monday, December 4

Vespa Meetup (Sunnyvale) - Monday, December 4

Airflow Meetup @ Airbnb (San Francisco) - Monday, December 4

Building an In-House Elastic Hadoop/Spark Service on Multi-Cloud Environments (San Francisco) - Tuesday, December 5

Kickoff Meeting: Data Engineering Meetup (San Diego) - Wednesday, December 6

Big Data Processing at Spotify: The Road to Scio (Santa Clara) - Wednesday, December 6


Cloudera User Group Featuring Ascension Health and StreamSets (Austin) - Wednesday, December 6


The Founder of Kafka on the Rise of the Streaming Platform (Minneapolis) - Monday, December 4


Get an Update on HDF 3.0 (Madison) - Tuesday, December 5

IRELAND Apache Spark Fundamentals (Dublin) - Saturday, December 9


Apache Kafka: Real-Time Scalable Data Pipelines, by Dan Cook (Bristol) - Monday, December 4

An Evening of Apache Kafka (Manchester) - Wednesday, December 6


Getting to Know Apache Kafka (Oslo) - Tuesday, December 5


Sports Results and Music Videos, Powered by Apache Kafka (Amsterdam) - Thursday, December 7


Our First Kafka Meetup! (Hamburg) - Monday, December 4


Hadoop Meetup @ Microsoft (Bangalore) - Friday, December 8

Deep Dive into Spark Tuning (Bangalore) - Saturday, December 9