Data Eng Weekly

Data Eng Weekly Issue #312

27 May 2019

I ended up taking last week off, so we have a slightly more articles than my 10 per issue I've been targeting. So there's lots of good stuff to catch up on including several posts on data pipelines and data quality (from Amazon, Stitch Fix, and more) as well as some pertinent posts on SQL debugging, PostgreSQL's vacuum, Kubernetes health checks, and deduplicating events.


This article is a great introduction to Online Event Processing (OLEP), an alternative to transactions for heterogenous distributed systems. The authors also describe how OLEP can replace OLTP systems, and how you can implement atomicity (but repeatable reads across downstream systems is not currently possible). Lots of good stuff in this well-written, easy to consume article about how to build a system with a distributed log.

Pylint-airflow is a new pylint plugin to find (stylistic) errors in your Airflow programs. It has a few checkers for now, covering things like unused XCom's and mixed dependency directions.

Apache Flink 1.7 introduced temporal tables, which store a time series. This post looks at the Flink APIs for joining a stream with a temporal table (you must also define a temporal table function) using the canonical example of currency exchange rates.

It hasn't been that long since Google wrote about using deep-learning models to create database indexes. A new paper applies a similar technique to detecting Quality of Service violations in distributed systems. Similar to distributed tracing systems like Zipkin and Jaeger, services and clients are instrumented to capture latencies. This data is then used to train a Deep Neural Network that can predict latency violations.

Amazon has a post describing how they use Deequ, a Spark library, for testing the quality of data sets. Deequ, which is open source, has a lot of builtin constraint verification functions, like evaluating minimum/maximum/null values, testing for correlations, and checking approximate counts (using HLL++).

A look at the new features of Apache Avro 1.9 - a new version of Jackson, a switch to the Java8 date time API, and support for zstandard. Lots of good stuff for a project that's at the heart of a lot of data platforms.

A good walkthrough of debugging a slow SQL query for a major (>700x) improvement in performance. The author describes isolating the problem, finding a different solution, and verifying correctness.

A look at the notion of a data mesh as the next evolution of data platforms to replace data warehouses and data lakes. These monolithic data repositories suffer from the same problems as a monolithic services architecture (like tight coupling), and the data mesh is meant to be product driven (with product owners for each data product) to improve agility. A common data infrastructure platform, to solve things like security and addressability of data (e.g. a s3 bucket or a Kafka topic), is also an important component. Lots of good ideas worth incorporating into your data platform.

A look at your options for implementing a Kubernetes liveness problem (health check) for Kafka Streams (or any other JVM application that exposes some useful metrics via JMX). The post also describes how to expose metrics for ingestion into Prometheus.

Stitch Fix shares several best practices for writing maintainable data pipelines. These include leveraging SQL, writing maintainable SQL queries by utilizing Common Table Expressions, using a workflow engine, and implementing data quality checks. Also notable—at Stitch Fix, data scientists own their own pipelines end to end (including maintenance after launch).

This post has details on PostreSQL's Vacuum functionality, which is used to reclaim disk space and optimize the database. For large databases, this process can take many days. Using pg_stat_progress_vacuum to measure the status of the vacuum, the authors plot several visualizations of the vacuum progress. With these, they explain what's going on (at a high level) and have several links if you want to dive into even more details.

A common approach to idempotent writes of data is to dedup based on a unique ID. This can be an expensive process, since writes cause reads (or multiple read operations, depending on the database). This post describes how to use bloom filters for deduplication, which in this use case provided major speed improvements as well as pretty impressive cost savings in infrastructure given the improved efficiency.


Curated by Datadog ( )



Running Flink Applications in Kinesis Analytics (Seattle) - Thursday, May 30


Kafka on Kubernetes: Does It Really Have to Be “The Hard Way”? (Plano) - Tuesday, May 28


Splice Machine and Hadoop with Clearsense (Jacksonville) - Tuesday, May 28


May Meetup Night (Danvers) - Thursday, May 30


Data & Analytics Meetup (Sao Paulo) - Tuesday, May 28


Tips and Tricks about Apache Kafka in the Cloud for Java Developers (Madrid) - Wednesday, May 29


Apache Kafka: Optimizing Your Deployment (Paris) - Tuesday, May 28

Scio: A Scala DSL for Beam + Spark Best Practices (Paris) - Tuesday, May 28


The 10th Apache Kafka Meetup (Utrecht) - Tuesday, May 28


From Batches to Events (Stuttgart) - Wednesday, May 29


Context Buddy + Streaming Pipelines Using Apache Kafka (Krakow) - Tuesday, May 28


Modern Big Data and Analytics Architecture Patterns on AWS  (Tel Aviv-Yafo) - Tuesday, May 28

Women in Big Data Israel: First Meetup! (Tel Aviv-Yafo) - Thursday, May 30


Apache Kafka Is More ACID Than Your Database (Singapore) - Wednesday, May 29

Traveloka: How We Run Cloud-Scale Apache Spark in Production Since 2017 (Singapore) - Wednesday, May 29

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.