Data Eng Weekly

Data Eng Weekly Issue #308

07 April 2019

It was tough choosing from all the great articles from the last two weeks—so much great stuff. This issue covers Lyft's metadata engine, local development with Apache Airflow, optimizing queries, testing Postgres DB performance with production workloads, and more. There's also a great video on idempotence and an article on learning to build distributed systems.


It turns out that there are some well-researched formulas for generating synthetic events (spacing them out over time in a realistic way), and a small tweak makes them credible for real world simulation (where time of day is also important). This post describe the formulas and has python code to do the generation, which should be useful for realistic internal test data.

Idempotence is often a desirable feature in distributed systems, because it makes retrying after failures much easier. This video (or the transcript if you'd prefer) covers in detail why idempotence is important, and what some common strategies are for implementing it (with email server sends for example).

Apache Arrow 0.13 was released. Included for the first time is the Rust-native query engine, DataFusion. There's a separate post about the DataFusion project, which supports SQL queries of Apache Parquet files and has a new experimental DataFrame-style API.

Lyft's Amundsen is their metadata collection and discovery engine, which indexes data from data stores, dashboards, schemas, streams, and more. The post highlights a number of important things to consider for a system, like the ABCs of metadata (application context, behavior, and change), the discovery vs. curation trade-off, and compliance (e.g. for GDPR). Lots of great inspiration in this post, like the types of details they highlight for each dataset, their search-centric UI, and how they collect feedback.

This post on learning to build distributed systems, describes why it's hard to get started with building them and several strategies for bootstrapping your knowledge—like reading academic papers and watching videos from practitioners. Lots of great tips inside, many of which revolve around embracing and learning from failure (which is inevitable in distributed systems).

Whirl is a tool for standing up an Apache Airflow environment for testing using Docker containers. While you can sometimes write unit tests for your data pipeline, it's often pretty difficult to do more sophisticated testing. Perhaps Whirl with Minio and similar tools can get you most of the way there.

A great collection of tips for speeding up your PostgreSQL queries both by optimizing (several ways) the queries themselves and by adding appropriate indexes.

Datadog writes about how they build a highly reliable pipeline (they have a nice elaboration of what this means) for their batch processing built on Luigi and Apache Spark. The post covers how they optimize pipelines for recoverability by breaking down jobs into smaller chunks. This goes hand in hand with monitoring, including how they track a metric for data latency to detect jobs that are falling behind.

Great story of how one company moved a bunch of data across cloud providers. Lots of details about how they managed to move the backend without downtime, and the types of hardening (such as a kill switch) that went into the data migration service.

Replaying database queries on a non-live DB for load testing or baking seems like a great idea but oftentimes doesn't happen because it's too complicated. This article introduces a new tool, written in Go, to parse logs and replay queries. Seeing how this tool is built makes query-replay less intimidating and provides valuable insight into load testing your own database.

Twitter writes about their investigation into a performance issue with their Redis-based caching system. They dive really deep into Redis' cache invalidation codepath, going as far as to analyze the assembly instruction-by-instruction. It's a good example of how distributed systems and traditional system optimization go hand-in-hand.


Curated by Datadog ( )


#SDBigData Meetup #26 (San Diego) - Tuesday, April 16

Cassandra & Multi-Cloud + Ingest Data from RDBMS to Cassandra w/ StreamSets (Santa Monica) - Thursday, April 18


Real-Time Streaming Analytics (Bellevue) - Tuesday, April 16

Seattle Apache Kafka Meetup (Bellevue) - Thursday, April 18


Unified Next-Gen Data Platform, Presented by DataStax (Coppell) - Wednesday, April 17


Running Kafka on k8s + Monitoring with Prometheus (Tampa) - Wednesday, April 17


3rd Data Engineering Meetup (Belo Horizonte) - Tuesday, April 16

IRELAND Tech Talk at Night #6: Dublin Backend (Dublin) - Monday, April 15


Thorough Introduction to Kafka + Blockchain Insights with Kafka (London) - Wednesday, April 17


Apache Airflow Code Breakfast (Amsterdam) - Friday, April 19


Kafka on Kubernetes + Pipelines with Hadoop (Munich) - Monday, April 15


April Meetup: Druid, Flink, Kafka for Data Analytics (Bucharest) - Thursday, April 18

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.