Data Eng Weekly

Data Eng Weekly Issue #314

16 June 2019

A couple of new interesting tools, online course materials for distributed systems, and several posts on data formats (including a look at performance of HDFS erasure codings). Also a look at the Apache Flink network stack, a new Jespen analsyis, and how Apache Kafka is used for astronomy. Should be something for everyone!


Kedro is a new data and development workflow framework that implements best practices for data pipelines with an eye towards productionizing machine learning models. Kedro has an interesting SDK for working with datasets, and it has an integration with PySpark. There are some great docs for the project including several tutorials.

The Apache Flink blog has a thorough design overview of the netty and Akka-based network stack. For Flink 1.5, they've implemented credit-based flow control for multiplexed channels. The architecture includes some clever backpressure handling to keep latency low while providing high throughput.

Debezium has started a community newsletter highlighting articles, releases, and Q&A related to their change data capture system. Lots of good stuff to read in here.

The Cloudera blog has a post on HDFS erasure coding, comparing performance using various benchmarks (e.g. terasort and word count) between data stored using replication and erasure encoding. They also describe failure recovery, which takes a bit longer for erasure coding. The post also covers several factors that can impact peformance like data locality, block size, and Intel's hardware acceleration.

This article describes a really interesting astronomy use case for Apache Kafka. After analyzing images of the night sky, events about changing objects (at the rate of ~1 million per night) are published to Kafka for analysis downstream by researchers.

Another excellent Jepsen post, this time on TiDB. There are a lot of interesting details about failure scenarios and edge cases—proving once again that distribution systems are complex. I particularly appreciated the discussion of consistency models that describes how TiDB's compare to MySQL's.

PingCAP, the creators of TiDB, have an online self-directed course for writing distributed system in Go and Rust. They provides interfaces and test cases for a number of coding exercises. Seems like a great way to learn first hand.

A look at one organization's use of Apache Parquet (to replace JSON) with Presto that resulted in significant speed ups. The post includes example code for producing Parquet in Python as well as some tips for getting the python library to produce a data file that works well with Presto.

Stein is a Node.js application that presents a RESTful API based on a dataset stored in a Google Sheet. For simple projects, this could be a straightforward way to stand up a prototype.

The SSENSE blog has a look at the trade-offs between CSV, Parquet, and Avro data formats. The author shares some benchmarking results which lead them to Avro for their use case. And it may be of interest to some folks—all the code samples are JavaScript in this post.


Curated by Datadog ( )


Overview of Microsoft Data Platform Offerings (San Francisco) - Monday, June 17

Data Integration with Kafka: What, Why, How (Irvine) - Wednesday, June 19

WeWork's Use of Apache Kafka On-Prem + Rebalance Protocol Inside-Out (San Francisco) - Thursday, June 20


Chick-Fil-A Spark Use Case for Data & Analytics (Roswell) - Thursday, June 20


Implementing Distributed Tracing Like a Boss in Your Apache Kafka Deployments (Reston) - Tuesday, June 18


Spark in Kubernetes (Toronto) - Monday, June 17


5th Data Engineering Meetup (Belo Horizonte) - Tuesday, June 18


Apache Cassandra & Apache Kafka Workshop (London) - Thursday, June 20


From Zero to Hero with Kafka Connect (Oslo) - Tuesday, June 18


ParisDataEng: Stream Data Processing (Paris) - Thursday, June 20


Apache Cassandra & Apache Kafka Workshop (Berlin) - Tuesday, June 18

Introduction to Knative + Kafka on Kubernetes (Berlin) - Tuesday, June 18


Topic Management at Scale (Amsterdam) - Tuesday, June 18


Kafka and Its Friends (Bern) - Thursday, June 20


Getting Started with Spark & Cassandra + Using pySpark with Google Colab (Milano) - Wednesday, June 19


Big Data v 4.0 (Warszawa) - Tuesday, June 18


Kafka Meetup #9 (Bengaluru) - Monday, June 17

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.