Data Eng Weekly

Data Eng Weekly Issue #309

28 April 2019

Another issue covering two weeks worth of content (I'll return to the weekly issues again eventually!), and there are great articles on change data capture from Uber, powerful features of Postgres, and auto-scaling Apache Kafka Streams. There are two newly open-source projects from Databricks and Google which are worth checking out, too.


The Uber Engineering blog has a look at their change data capture system, called DBEvents, which ships lots of data from MySQL, Cassandra, etc to HDFS and Kafka. Their post goes into the details of how they bootstrap new data sources, the tools they've built for freshness, quality, and efficiency, and specifics on how they've standardized changelog events across systems. Several of the tools that that mentioned, like StorageTapper for processing the MySQL binlog and Marmary for data ingestion, are open-source projects from Uber.

This article has a good look at the trade-offs and considerations in designing a distributed database. It describes YugaByteDB's approach to those key decisions, like how to implement distributed transactions and considerations for clocks, as they implemented a system that is based on the Google Spanner architecture.

An Apache Kafka Streams application can dynamically scale its number of workers, taking advantage of the Kafka client's builtin consumer rebalancing. This post looks at how you can link this with Kubernetes' custom metrics + autoscaling support (although the solution works just as well for non-Kubernetes deployments) to scale up and down based on consumer lag. There's a tutorial for implementing this strategy using the Prometheus jmx-exporter and Google Stackdriver.

If you had asked me whether SQL is Turing complete, I’m not sure how I’d answer. It turns out it is, and you can do some interesting things once you have that knowledge—like generating fractals. This post describes a bunch of interesting SQL features to try out, like Postgres' generate_series and recursive common table expressions (which can be used for iterative algorithms).

Databricks has open sourced their Delta platform for Apache Spark that adds ACID semantics, data versioning, schema enforcement and evolution, and more. If you want to quickly try it out, the Delta 0.1.0 release provides drop-in replacement for saving a Spark dataframe to Delta rather than as Apache Parquet.

Google has open sourced ZetaSQL, which is a SQL frontend (parser, analyzer, and more). It is built to work with various backends, and it comes with with lots of table stakes features like approximate algorithms (HLL and friends) and JSON support. It also has some interesting features like native support for constructing protobufs and writing UDFs in JavaScript. As you might expect, GRPC is used to communicate with the server. Given how popular Apache Calcite has been in the JVM ecosystem, it'll be interesting to see if ZetaSQL takes off elsewhere.

Really handy list of less-known but highly useful Postgres features, like pub/sub notifications, table inheritance, and various types of indexes. Even if you're using Postgres on a daily basis, chances are there's something new to try out.

HomeAway describes how they ship PMML model definitions to KSQL (via a user-defined function) to implement real-time model evaluation. If you're not familiar with PMML, it's a cross-language standard for defining model parameters. There's a link to the PMML file for the accompanying demo (the code for the the end-to-end-demo is on github) if you want to see what a file looks like.

Who doesn't love a simple solution to distributed systems problem? This post describes how instagram solved a thundering herd problem occurring during cold start of a cache-heavy service. Briefly, they had the clever idea to cache Promises rather than the actual query responses. If you're not familiar with thundering herds, the post serves as a good introduction.


Curated by Datadog ( )


Apache Kafka Controller with Jun Rao + Kafka at Uber (Palo Alto) - Monday, April 29

Analyzing Streaming Data + Columnar Formatted Data for Analytics (Menlo Park) - Tuesday, April 30

Microservices to Analyze Event Streams + The Art and Science of Capacity Planning (Foster City) - Tuesday, April 30

The Latest in Apache Hive, Spark, Druid, and Impala  (Palo Alto) - Wednesday, May 1


Data Engineering and the Data Science Lifecycle (Saint Louis) - Wednesday, May 1


CEO/Founder of Qubole Discusses Big Data in the Cloud: From Facebook to IoT (Chicago) - Wednesday, May 1


CEO/Founder of Qubole Discusses Big Data in the Cloud: From Facebook to IoT (Madison) - Tuesday, April 30


Big Data Processing Engines: Which One Do I Use? (Atlanta) - Tuesday, April 30

New York

Introduction to Data Processing with Apache Spark (Rochester) - Thursday, May 2


Apache Beam Meetup: Beam at Lyft + Datalake Using Beam + Schemas (London) - Wednesday, May 1


Confluent and Visual Meta Kafka Talk (Berlin) - Monday, April 29

Kafka, Microservices, and More from Confluent and Camunda (Munich) - Monday, April 29


Our First Apache Kafka Meetup in Bern (Bern) - Thursday, May 2


Druid, Data Pipelines, and Engineering (Tel Aviv-Yafo) - Tuesday, April 30

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.