Data Eng Weekly

Data Eng Weekly Issue #303

24 February 2019

Several posts this week on RDBMS performance analysis & optimization, change data capture, and systems built with Kafka. There's also a great post about implementing unified event logging and a new paper on distributed consensus. In releases, Flink, Kafka, Pulsar, and NiFi all announced new versions this week.


Deliveroo writes about how they've designed a Kafka data platform using Protobuf messages. The post describes how they strictly enforce schemas using tests written with the Protobuf Java API and at runtime with a clever configuration of producer authentication.

Citus Data has a post on the pg_stat_statement extension for PostgreSQL, which records metrics about queries and exposes those for monitoring and analysis.

If you've used a relational database long enough, you've probably been there—a query that used to be fast suddenly slows down. After identifying the root cause (the optimizer was choosing the wrong index), the post describes how to perform table optimization and analysis in MySQL without the normal downtime for those operations.

An introduction to and tutorial for running FoundationDB on AWS. There's also some example python code for loading and querying the database.

Another database performance tuning post describes how to force MySQL to use an index at query time. The article has a good explanation of how the database optimizer decides whether or not to use an index... and why it makes the wrong decision in this example.

A collection of notebooks documenting best practices for working with PySpark.

At first blush, SQL might seem powerful for simple queries but limited in more complex computations. This post describes a couple of situations where SQL can outperform a python script by 1) using window functions and 2) parsing and manipulating JSON data (both operations use some advanced SQL features).

A look at how Intercom uses DynamoDB Streams (a change data capture feature of Amazon DynamoDB) with a Lambda function to compute and store JSON PATCH data about historical changes to their data sets.

This relatively easy to read paper on distributed consensus describes a new, generalized solution to distributed consensus using write-once registers. If you've enjoyed the Paxos and related papers, then this one is a must read.

The Debezium blog has a post that discusses the tradeoffs of event log architectures (write to Kafka first vs. change data capture) and proposes the outbox pattern, which softens some of the tradeoffs. Briefly, the idea is to have an outbox table with a generic field in which you can write JSON or other event data as part of the same transaction as your DB operations. This table is mirrored to Kafka using a change data capture tool like debezium.

The Last Pickle has some advice for distributing your keyspace when building a new Apache Cassandra cluster by specifying initial tokens for seed nodes. The post goes into detail on how to calculate optimal configuration and bring up a new cluster with the right settings.

Good description of the components of a unified event logging framework and some practical advice for standing up the main components. The post covers how one company does event validation against a schema, implements schema composition for Avro, and uses heartbeat events to detect data loss.

You may have heard about the recent PostgreSQL fix to their fsync (data sync to disk) implementation that impacts several versions of the database. This post from Percona describes the problem, the fixes in recent updates, and several other details.

This three part post describes an analytics project that consumes data from the Cosmos DB change feed feature using Azure Databricks. Lots of pieces and good details on how to implement something with these tools.

A good introduction to testing Apache Airflow tasks and DAGs using pytest. The post covers things like mocking Airflow, mocking external dependencies (e.g. Postgres), and fixtures for Airflow DAGs. It also covers debugging tests using the python debugger, pdb.

This post walks through building a Kafka Connect source connector for an HTTP API. It is written in Scala, the tradeoffs of which are covered in the post.

AWS has a post on the steps that EMR takes to minimize the impact of lost nodes during a Spark job, which can cause major delays in cases where intermediate data needs to be recalculated. Some of the actions, including using YARN decommissioning & enabling several Spark options, can also be used in other cases where you might need to decommission nodes in a Spark cluster with minimal impact.

This tutorial covers feeding NGINX logs into Apache Kafka (using either kafkacat or the python producer), and using Apache Flume to write data from Kafka to HDFS.

Apache NiFi can infer the schema of data it processes and convert records to Apache Avro. This walkthrough covers the two ways it does that—either at ingestion or while deserializing data.


Data Engineer - Python, Wooga, Berlin

Software Engineer, Value Platform, Nuna, Inc., San Francisco

Data Engineer, Starship Technologies, Tallinn, Estonia


The call for papers for Kafka Summit San Francisco is now open through April 15th. The conference takes place September 30th and Oct 1st. Meanwhile, the agenda for Kafka Summit London, which takes place in May, has been announced.

An overview of common responsibilities that define data engineering and data scientist roles. The post also talks about tools and provides an example of the distinction between the two roles.

The Alibaba blog has an article describing the history of the Flink API, Flink Checkpointing & Recovery, and the Flink Runtime from version 1.0.0 through the 1.6.0 release. Lots of interesting historical context that show the evolution of the Flink project over the last few years.


Apache Flink 1.7.2 was released with over 40 fixes and improvements.

Apache Kafka 2.1.1 is out with over 40 bug fixes and improvements.

Databricks Delta has new features for converting tables to Parquet, time travel, and for merging data. There are more details in the announcing blog post.

The 2.3.0 release of Apache Pulsar is out. It has a number of new features like zstd compression, token baed authentication, and optimizations for the Presto connector. There are also enhancements to Pulsar IO (several new connectors) and Pulsar Functions (including a Kubernetes runtime).

Apache NiFi 1.9.0 was released with improved integration with Apache Kudu, Apache Impala, Google Big Query, and Amazon Web Services. There are over 100 improvements in the release.

Marqeta has open sourced Quantum, an engine for time series aggregation that uses Redis Cache for storing materialized aggregate data. Aggregations are defined using a YAML file, and it has a simple DSL called Q for querying data.


Curated by Datadog ( )


Data Updates in HDFS-like Stores Using Apache Hive, Hudi, or Iceberg (San Jose) - Tuesday, February 26


Kafka's Core Concepts by David Greenfield (Chicago) - Wednesday, February 27

North Carolina

High-Performance Stream-Oriented Processing Systems for IoT (Charlotte) - Tuesday, February 26


Data Processing Platforms (Toronto) - Tuesday, February 26

An Introduction to Streaming Data and Stream Processing with Apache Kafka (Toronto) - Wednesday, February 27


First Data Engineering Meetup (Belo Horizonte) - Tuesday, February 26


Making Stream Processing Scale (London) - Tuesday, February 26


Data Engineers Meetup @ Datadog (Paris) - Tuesday, February 26


Real-Time Fast Data in a Microservice Architecture (Dedemsvaart) - Thursday, February 28


First Apache Kafka Meetup Stuttgart 2019 (Leinfelden-Echterdingen) - Wednesday, February 27

Kafka Meetup (Munich) - Thursday, February 28


Meetup on Kafka & Strimzi (Bern) - Thursday, February 28


Streaming Topic Model with Apache Flink + Using IoT Analytics to Save the Planet (Krakow) - Friday, March 1

DataMass Meetup #2: Kafka for Beginners (Gdansk) - Friday, March 1


Apache Kafka in Production (Tel Aviv-Yafo) - Wednesday, February 27


Apache Kafka and KSQL (Sandton) - Wednesday, February 27


Inaugural Melbourne Data Engineering Meetup (Melbourne) - Wednesday, February 27