Data Eng Weekly

Data Eng Weekly Issue #291

02 December 2018

Lots of coverage of Apache Kafka and change data capture this week, including Kafka at Pinterest, why Twitter is moving to Kafka, and change data capture for MySQL and MongoDB. There are also posts from Criteo on their real time Apache Flink use case and AirBnB on their Spark Streaming and HBase data load system. In releases, it was a busy week with a few AWS re:Invent announcements, Dask and Apache Samza hitting version 1.0, and the first major release of FoundationDB since it was open sourced. With two weeks worth of articles to sort through, it ended up being a pretty large issue (even after I cut several good posts).


Introducing the first AI / Machine Learning course with a job guarantee. Springboard's new AI / Machine Learning Career Track is an intensive program that will equip you to transition into a role as a machine learning engineer.

With support from their career services team, they confident that you will find a job within six months of completing the course. But if you don’t, they'll refund your tuition.

Learn more today:


This presentation has a fantastic overview of Change Data Capture with MySQL. It motivates the use case, describes how Debezium streams changes to Kafka based on the MySQL binlog, and describes how to query the data that it creates. There's also a discussion of building the same for Apache Cassandra, which doesn't have a single binlog to plug into due to its replication model.

The dataArtisans blog has a post on the four key steps to making an Apache Flink deployment production ready.

Criteo writes about how they use Apache Flink for real time analytics of data in their multi-data center Kafka cluster. There are some interesting details about how they use watermarks for event time as data is replicated/re-partitioned across clusters.

For real time event data, AirBnB uses Spark Streaming to send events from Apache Kafka to Apache HBase for deduplication. Typically, Kafka consumers are limited to the parallelism of the number of topics, which doesn't scale in the way that AirBnB needs. Taking advantage of the fact that their application doesn't require processing events in order, they have built a new Spark-Kafka connector that can dynamically scale by splitting data inputs within a partition by offset ranges.

This post on the Luigi workflow engine covers how the team at ZOLA Electric deploys Luigi via Docker (and configures the visualizer and task history), replicates data from RDS to Redshift, represents dependencies between SQL scripts via annotations, integrates with Slack for alerting, and more.

YugaByte DB is an interesting new database to watch—it supports the Cassandra, Redis, and Postgres (beta) APIs. In this post, they demonstrate the differences between a NoSQL and SQL database with the Cassandra and Postgres interfaces as an example. For extra comparison, they also do the same with MongoDB.

The Confluent blog describes the high level tasks needed to design, build, and test a multi-datacenter Apache Kafka deployment.

The recent alpha release of Apache Hadoop Ozone, version 0.3.0-alpha, adds S3 API compatibility to the blob store on Hadoop system. This post shows how to get started with it.

Pinterest has written about their Kafka deployment on AWS. They describe their architecture and configuration: replicating between clusters in separate regions, deploying brokers within a cluster across availability zones, what types of instances they use (mostly d2.2xlarge's) and the config they use for those instances, and more.

This post describes the benefits of keeping track of SQL queries in git and creating a query library to share them across the members of a team.

Twitter writes about why they're moving to Apache Kafka from their home-grown (based on Apache BookKeeper) system. The main reasons include: better latency and costs savings since Kafka uses the same servers for storage and compute, the Kafka community, and improvements in hardware (specifically faster network and cheaper SSDs).

Sonra has a good overview of the differences between AWS Athena and Snowflake. It covers things like performance (concurrency, caching, query optimizer), security, data loading, file type support, and more.

A good list of PostgreSQL extensions, add-ons, tools, and more for Postgres. These include PipelineDB for continuous SQL queries, Citus for high availability, Cstore for columnar storage based on ORC, and PostgREST for putting a RESTful API directly atop of an existing database.

This article describes how to configure Debezium to build a change data capture pipeline atop of MongoDB.

Databricks has an overview of the new Apache Avro features from Apache Spark 2.4, including native support via to_avro and from_avro, performance improvements, and support for Avro logical types (like decimal).

AWS re:Invent was this week, and there was a good presentation on the new features of Amazon EMR. The slides cover options for running a metastore (including Glue and a standalone RDS instance), Jupyter notebooks on EMR, cluster autoscaling, and more.

TLDR Newsletter

Since before I started this newsletter, I've consumed a lot of news via weekly newsletters. I recently discovered TLDR, which is sent out daily. It's a high-quality list of concise summaries and links to stories pertaining to what's happened to tech in the past 24 hours. If you either feel like you're missing out on news from sites like twitter, hacker news, and reddit (or if you spend too much time checking them each day!), then the TLDR newsletter is a fantastic way to keep on top of things.


The team at Pachyderm raised a round of Series A funding, and Datanami has a good post on how the platform (which is meant as an alternative to Hadoop) differentiates itself. For instance, Pachyderm embraces Docker and Kubernetes and its filesystem includes version history for data and provenance.

HP Enterprise is acquiring BlueData, makers of the EPIC software for deploying big data services via Docker containers.

Dremio's work to add LLVM execution to Apache Arrow (known as the Gandivia Initiative) has been contributed back to the Apache Arrow project.

Amazon announced a public preview of Amazon Managed Streaming for Kafka (MSK). This post is one of the first reviews of the service, and it gives you a good idea of its features (like the version of Kafka) and pieces that are missing.


Introducing the first AI / Machine Learning course with a job guarantee. Springboard's new AI / Machine Learning Career Track is an intensive program that will equip you to transition into a role as a machine learning engineer.

With support from their career services team, they confident that you will find a job within six months of completing the course. But if you don’t, they'll refund your tuition.

Learn more today:


Databricks Runtime 5.0, which is based on Spark 2.4, is out. There's also a new version of the streaming source for Apache Kafka and Azure Blob Storage file notification.

Esper 8.0.0 was released. Its Streaming SQL engine, which includes Complex Event Processing, now includes a SQL-to-JVM byte code compiler.

Apache Kafka 2.1.0 was released. It adds Java 11 support, zstandard compression and includes lots of improvements to the streaming API, admin script, and DNS handling. In total there are 28 KIPs and 179 JIRAs in the release.

Qubole has announce release 54, which includes new versions of several open source projects (e.g. Apache Spark and Apache Airflow) and several new features for Microsoft Azure.

Apache AIrflow 1.10.1-incubating has been released. New features include the ability to delete a DAG from the web UI, integration with AWS SageMaker, and operators for Google Cloud Functions.

Version 1.0 of Apache Samza was announced. The LinkedIn blog has a post describing Samza's major features, including the new ones in version 1.0 (including improved testability and the ability to run Samza outside of YARN). It also talks about some plans for the future.

Amazon Kinesis Data Analytics for Java is a new library for building streaming applications. It's based on Apache Flink.

Version 1.0.0 of Dask, the parallel computing and task scheduling system for Python, has been released.

Apache Flink 1.7.0 adds Scala 2.12 support, exactly-once S3 StreamingFileSink, several new features for streaming SQL, the Kafka 2.0 connector, and much more.

FoundationDB, the distributed database that was open sourced by Apple (after they acquired the company behind it), announced version 6.0.15. It's the first major release since open sourcing, and the major new feature is multi-region support with seamless failover. On the heels of the announcement, the FoundationDB team also announced a new "Document Layer" that provides an API that's compatible with the MongoDB protocol.


Curated by Datadog ( )


Vespa + Bullet (Sunnyvale) - Wednesday, December 5

Centralized Metadata + Build Apps w/o Pipelines + Docker & Kubernetes (Sunnyvale) - Wednesday, December 5

Demystifying AWS Analytics + Building Streaming Analytics Using Kafka & Druid (Santa Clara) - Thursday, December 6


Advanced Analytics with Apache Spark (Katy) - Saturday, December 8


Data Ingestion at Scale (Saint Louis) - Wednesday, December 5


PGH's 1st Kafka Meetup: Fotios Filiacouris from Confluent (Coraopolis) - Thursday, December 6


December Presentation Night (Cambridge) - Tuesday, December 4

Learn How Voya Financial Uses StreamSets (Cambridge) - Thursday, December 6


Apache Kafka: From Messaging to a Streaming Data Platform (Sao Paulo) - Friday, December 7


Data Matters with Jess Anderson (London) - Wednesday, December 5


Streams, Financial Alerts & Self-Service on Apache Kafka (Utrecht) - Thursday, December 6

APIs, Q&A, and Lessons Learned after Using Databricks @ Dynniq (Amsterdam) - Thursday, December 6


SysOps/DevOps Katowice MeetUp #3 (Katowice) - Wednesday, December 5

Data Meetup (Warszawa) - Wednesday, December 5


Let's Talk about Apache Kafka and What's New in Java (Nis) - Thursday, December 6

BOSNIA AND HERZEGOVINA Apache Kafka: Central Data Pipeline (Sarajevo) - Wednesday, December 5


December Sydney Data Engineering Meetup (Surry Hills) - Wednesday, December 5

The Art and Science of Capacity Planning with Gwen Shapira (Melbourne) - Wednesday, December 5

Software Engineering Sydney December Meetup (Sydney) - Thursday, December 6