Data Eng Weekly

Hadoop Weekly Issue #227

06 August 2017

Short issue this week, but several high quality posts including a presentation on Spotify's data infrastructure, an in-depth look at using Kafka as a central event store, several releases, and more.


Great presentation on how Spotify uses Google BigQuery and Scio (a scala wrapper for Apache Beam/Cloud DataFlow) to power their data infrastructure. Scio integrates with BigQuery to stream results of a query, implements several join optimizations, has a REPL, and has several other advance features. The presentation also discusses a dataset diff tool for validating pipeline changes and ML models and Featran-a library for transforming features in a type safe way. It wraps up with a number of challenges that they've faced along the way.

A walkthrough of change data capture and using events stored in Apache Kafka as the source of truth rather than the relational database. It describes patterns such as writing to the database through the queue, secondary indexing in elastic search, migrating away from legacy data systems, and so-called "memory images" for serving a dataset out of memory rather than via a relational database.

This post presents ten reasons why building your own deployment tools for Hadoop and Spark is not a good idea. It argues that rolling your own cluster is analogous to building a car.

The Cloudera blog has an overview of S3Guard, which is a new feature for Hadoop's S3A file system to make use of a metadata store to add a consistent view to data stored in S3.

BigDL is a deep learning library and GraphX is a graph processing library for Apache Spark. The Qubole blog has a tutorial demonstrating how to train and evaluate a model using BigDL. There's a second post that walks through (with example code) the GraphX APIs.

The Amazon Big Data blog describes how to enable Hive's LLAP to improve performance of Hive queries.


Version 3.3 of the Confluent Platform has been released. This version includes the exactly once semantics of Apache Kafka 0.11. The post has a tour of features, labeling them by where the features live (i.e. are they open source or enterprise-only).

Apache Curator 4.0.0 was released. Curator is a client library for Apache ZooKeeper, and the new version adds support for TTL Nodes, a strongly typed DSL, and more.

Apache Storm 1.0.4 and 1.1.1 were released. Both releases include a number of bug fixes and improvements.

Apache Flink 1.3.2 was released. It includes a number of bug fixes and improvements. Notably, there is a fix for users of the FlinkKafkaConsumer which may take additional care when upgrading.


Curated by Datadog ( )



Data Engineering vs Data Science, DataOps, & Overcoming Back Office Barriers (Westlake Village) - Tuesday, August 8


Hands On: Introduction to the Hadoop Ecosystem (Kansas City) - Tuesday, August 8


Spark: An Intro to Crazy Fast Big Data (Orlando) - Wednesday, August 9

North Carolina

Meet the Experts That Support Apache Kafka (Raleigh) - Wednesday, August 9

New Jersey

Deep Dive into HDF 3.0 (Morris Plains) - Tuesday, August 8

New York

Building a Production Data Pipeline (New York) - Thursday, August 10


Architecting a Recommendation Engine with Scala & Spark (Singapore) - Thursday, August 10


Data-in-Motion: Recent Advances in Apache Projects for Streaming Data (Sydney) - Wednesday, August 9

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit