Data Eng Weekly

Data Eng Weekly Issue #315

30 June 2019

This weeks issue once again covers two week's worth of content, and there are great articles covering practical advice (e.g. how to deploy Airflow on Kubernetes) to some interesting forward-looking ideas (e.g. the multiverse database). For those actively working on data engineering projects, there should be some useful stuff whether you're implementing logging systems (see Pinterest's Singer), deploying Kafka for change data capture (see Yupto's post on their CDC pipeline), or building software with Presto, Spark, or Flink (useful articles on each!).


Lots of great papers covered in the morning paper these past few weeks—I've picked one of my favorite for this issue. It covers the notion of a multiverse database, one in which each user sees a "parallel universe" containing only the data that they have permissions/access to read. To improve performance, the universes are partially pre-computed (striking a balance between read latency and disk usage).

Qubole writes about an upcoming feature of Presto—Optimized Local Scheduling. By taking advantage of split locality, the new scheduler improves cache reuse and minimizes the amount of data transferred over the network. For some workloads, they see 9x improvements. If you want to read the original design doc—there's a link at the bottom of the page to the Github issue that references it.

Pinterest has open sourced Singer, their logging agent that sends data to Apache Kafka for centralized logging. Singer provides Java and Python libraries for both text and thrift format data. It also supports logging in Kubernetes via a sidecar. It includes a heartbeat system and producing audit records for centralized monitoring and alerting.

Yupto writes about their change data capture pipeline for MySQL that's built with Debezium, Apache Kafka, Apache Hudi, and their Meteorikku open source ETL framework. Hudi looks neat—you can write your records and Hudi takes care of incremental updates based on key and time columns plus a partition. Th post also includes a bit on how they monitor the pipeline.

A good tutorial for running Apache Airflow on the Azure Kubernetes Service. It covers both the general Kubernetes configuration (via Helm and the Airflow Kubernetes executor) and some Azure-specific pieces (the container registry, postgres, and Azure File Share). They also use a tool called chaoskube to occasionally restart the Airflow scheduler to workaround some bugs.

This week's coverage of distributed transactions comes from the Fauna DB blog, where they cover time-travel anomalies under serializability and one-copy serializability guarantees. These include immortal writes, stale reads, and causal reverse—all three are described and there are some good diagrams to illustrate what happens. The post wraps up with a matrix of several types of serializable guarantees and what types of anomalies that can occur with each.

Apache Kafka 2.3 is out, and the Confluent blog has a post summarizing new features. These include improved monitoring, a maximum log compaction lag (important for GDPR), faster broker startup times, and improvements to Kafka Connect (e.g. a new cooperative rebalancing that improves performance during a reconfiguration) and to Kafka Streams (e.g. in-memory window store).

Handy collection of tools for monitoring and investigating state of your postgres cluster. There are scripts for things like finding unused indexes and looking at table bloat (from deleted but not yet vacuumed records). The list of scripts describes the permissions needed for each and a bit of description about what's returned plus what you should do with that info.

The Apache Flink blog has a post on Flink 1.5+'s Broadcast state, which can be used for joining a low-throughput stream with a high-throughput one. This enables a number of interesting use cases, such as using the low-throughput stream to propagate dynamic configuration. The post has good diagrams and some example code to demonstrate things.

cleanframes is a new library for Apache Spark that uses implicits to apply a number of data cleansing functions to your Spark data frames as they're loaded. There's a two part post introducing the library and describing how to extend it with custom transformations and more.

If you can get past the sensational headline, this article has a lot of good coverage of the evolution of the Apache Hadoop ecosystem over the past several years—e.g. we've only had columnar storage in Hadoop since 2012 and Apache Airflow has become a more active project than Apache Oozie. There's also a set of rebuttals to common statements arguing against hadoop/big data systems (e.g. "the dataset can fit in memory" and "just use Postgres").


Curated by Datadog ( )



Apache Beam Meetup 1 @ Criteo (Paris) - Monday, July 1


From Zero to Hero with Kafka Connect (Munich) - Wednesday, July 3

Let's Talk about MQTT in Real Life (Hamburg) - Wednesday, July 3

Apache Flink Meetup @ Amazon (Munich) - Thursday, July 4

Writing Stream Processors in Kotlin (Dusseldorf) - Thursday, July 4


Kafka Hack Night (Bern) - Thursday, July 4


Apache Kafka & New Ways of Stream Processing with KSQL (Wien) - Tuesday, July 2


Demystifying Kafka (Sofia) - Thursday, July 4


Strimzi: Distributed Streaming with Apache Kafka in Kubernetes (Istanbul) - Thursday, July 4


Services, Deployments, and Fun with Kafka War Stories (Ramat Gan) - Monday, July 1

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.