Data Eng Weekly

Data Eng Weekly Issue #298

21 January 2019

It seems like there must have been a lot of new year's resolutions to write blog posts, because this week's issue is a huge one. Lots of variety and coverage of less common topics, like workflow engines, Apache Pig/DataFu, and LSMs. In news and releases, there are conference videos (if you need even more content!) and a couple of new open source projects to check out.


As this post describes, to prevent tight coupling with your database system, you shouldn't rely on a DB-generated ID.

Sendbird writes about how they use HAProxy as a router/health checker for talking to replicated Redis.

Digdag is an open-source workflow engine from Treasure Data. Culture Amp writes about their choice to use it instead of Apache Airflow.

"Task Failed Successfully" is talk about a new workflow engine that aims to solve the shortcomings (and/or hacky workarounds) of other workflow engines. The system, Prefect, accomplishes this with new primitives to represent complex relationships between tasks, pass data, and react to failure (or non-binary task results)--see the second link for an example. If you like to geek on out workflow systems like I do, then I highly recommend watching the whole talk.

The Banzai Cloud provides a set of open source tools atop of Kubernetes, including a "spotguide" for running Apache Spark. They describe some of its features, including monitoring and log aggregation.

Qubole has a tutorial for building an application in their managed Apache Spark service that interacts with Amazon Managed Streaming for Kafka.

Many distributed systems attempt to be self-healing. In this case, that self-healing causes cascading failures across a number of complex systems: Apache Kafka, Kubernetes, Consul, and Vault. An interesting read of what happened and how they ultimately fixed the problem.

This post demonstrates two extensions for the Confluent schema registry—the first makes certain topics read-only, and the second provides a single page webapp to view the contents of the registry.

In the first DEW post on Apache Pig in some time, Paypal writes about several UDFs they've written for Pig and contributed to the Apache DataFu project. These include macros for diffing objects, deduplicating records, and sampling data by a key.

BBM has a blog series on how they've optimized scaling of Apache Zeppelin and Apache Spark by using autoscaling clusters and Spark's dynamic allocator.

The Confluent blog has a tour of various types of testing, along with available tools, for Kafka streaming applications. The post covers unit testing, integration testing, avro+schema registry testing, and multi-datacenter simulation testing (based on docker compose).

A transcript, slides, and video of a talk on how many high-performance data systems (e.g. Apache Cassandra or those using RocksDB) store data on disk using Log-Structured Merge-Trees (LSM). A key feature of LSMs is the ability to write data quickly by avoiding random disk io.

By default, Apache Hive and Apache Impala might not be doing what you'd expect when reading columns out of an Apache Parquet file. This post walks through some examples and summarizes how to fix if you stumble across the same issue.

This post has an introduction to higher order functions in Apache Spark 2.4.0, which enables a whole new slew of functions that take a lambda function as a parameter. There are some examples for nested arrays of data (e.g. filter, exists, aggregate, and aggregate).

A deep dive into the data replication architecture of Elasticsearch, including discussion of how the cluster responds to various failure scenarios and the PacificA distributed consistency protocol it uses.

This post has a collection of eight tips for optimizing performance of Amazon Redshift.

CockroachDB uses RocksDB as its underlying storage engine. This post describes why they chose RocksDB and how they've implemented core functionality. There's a good discussion of why most of their read access to RocksDB is via scans, some considerations with data replication, and how they batch operations to avoid the performance overhead of calling C from Golang.

Homeaway shares several hard-learned anti-patterns for building a streaming system. Their example shortcuts (e.g. hardcoding a partition key and making their header optional) were taken to increase adoption through ease of use, but caused problems down the line.

The author of this post describes three properties of a robust data pipeline—only processing required data, checkpointing, and documentation. It includes motivating examples for each.

TLA+ is a tool for evaluating correctness of distributed/concurrent systems. This post provides a preview introduction and shows how to invoke it via a CLI utility rather than using the clunky IDE.

Newsletter Rec

For staying up to date on the latest security news, tools, and best practices, I strongly endorse Security Newsletter. Its weekly emails are much like this newsletter but focused on security. Checkout the archives and subscribe!


Stratechery has a great analysis of the economics of open source software, using MongoDB/DocumentDB as a case study.

It turns out that the MySQL protocol allows the server to request arbitrary files from the client, a fact which was recently exploited.

"The Internals of PostgreSQL" is an online book for database admins and system developers. There are 11 chapters.

The videos from PyData DC 2018, which took place last fall, have been published. Lots of content on machine learning, distributed computation, data cleaning, and more.


Mockedstreams is a scala library for testing Apache Kafka and Kafka Streams applications. Version 3.1 was just released with support for the Scala DSL for topologies.

Apparate is a new open source tool from ShopRunner for bundling libraries via CI/CD for Databricks.

Version 2.6.0 of Apache Kylin, the OLAP service Hadoop, has been released.

The FoundationDB team has released a new Record Layer, which provides relational db semantics atop of RecordDB. As the blog post mentions, this system powers the iCloud CloudKit. It has a query API and query planner, although there doesn't yet appear to be a SQL interface.

Scylla, the open source database with compatibility with Apache Cassandra, has released version 3.0. The new version includes materialized views, secondary indexes, hinted hand-offs, performance improvements, and more.


Curated by Datadog ( )


CDC and Building a Streaming Analytics Stack with Kafka and Druid (Menlo Park) - Tuesday, January 22


Introducing the Data River: Apache Druid Is the Next Analytics Platform (Broomfield) - Thursday, January 24


Big Data Processing Engine: Which One Do I Use? (Weston) - Thursday, January 24

New York

Unifying Messaging, Queuing, Streaming & Lightweight Compute with Apache Pulsar (New York) - Thursday, January 24


Machine Learning, API, Flask & Python with Kafka (Bogota) - Thursday, January 24


Airflow Meetup @ Google (London) - Wednesday, January 23


IoT Sthlm #30: Scalability & Data (Stockholm) - Tuesday, January 22


Messaging and Connecting Services + HTTP Headers: A Research Expedition (Dusseldorf) - Thursday, January 24

Hadoop: Taming the Elephant with a Whale (Hamburg) - Thursday, January 24


Salting Your Spark to Scale (Herzliya) - Wednesday, January 23