Data Eng Weekly

Data Eng Weekly Issue #301

10 February 2019

This week's issue has the regular amount of content on Kafka and streaming data, and it also has several articles on some less frequent topics. These include Redshift, some debugging stories (such as with YARN+cgroups), Kubernetes, and loading data into Big Query using Google Cloud Functions. In releases, there are some interesting new projects from LinkedIn (Kafka Cruise Control Frontend) and FANDOM (Athena Alerter).


This tutorial shows how to build a Redshift query that efficiently joins data from the MaxMind GeoIP database to analyze the location of IPs. The solution includes a neat trick to optimize the join cost by computing a lookup table that enables filtering by IP prefix.

G-Research shares a good debugging story about how YARN's cgroup usage led to Linux Kernel issues (and how to fix it!).

The team at Disney Streaming writes about how they've built a solution to auto scale Amazon Kinesis Streams. The tool is built on AWS Lambda and works to predictively scale up/down based on log data.

This post describes the components of Apache Kafka that depend on Apache ZooKeeper. It then describes how to replace those pieces with an implementation of the Raft protocol built using the Atomix framework. Code for the implementation is on GitHub.

Event sourcing can enable lots of compelling use cases, but like many architectural designs there are trade-offs. In this post, the author shares his opinions/experience on a number of those—like upstart cost and complexity challenges when consuming an audit log.

Zenreach writes about how they implemented Kafka with Kafka Streams to process events pertaining to customer data, offloading from Mongo. They share details about their implementation, including advice for testing and a gotcha with co-partitioning data.

This post describes some tricks for loading data into Redshift with Apache Airflow as well as efficiently querying. Examples include running schema migrations as the first step in a workflow and writing a separate workflow to VACUUM the database.

A look at configuring Hive-on-Spark on CDH, which involves some special tuning and workarounds (as only Spark 1.x is supported with this setup).

This post captures the similarities between Kafka topic and ACL management with that of RDBMS schema management, and it proposes a migration-like solution to managing those configurations.

Kubernetes provides some great primitives for working with distributed systems. One of them is a PodDisruptionBudget, which can help avoid the type of issue described in this post by ensuring that pods aren't deleted under certain constraints.

Scylla writes about the data migration tool that they've built for migrating data from Apache Cassandra to their DB. The tool has some interesting properties, such as the ability to resume from a checkpoint, preservation of TTL and modification time, and support for simple transformations.

A tutorial for using a Google Cloud Function to load data via Google Cloud Storage into BigQuery. The post covers configuring permissions, managing API key secrets (for capturing data from a 3rd party system), and deployment.

A good overview of the big data landscape on AWS, from storage to data processing to orchestration to reading data. This is a good map of how the tools fit together with a brief introduction to each.

This tutorial shows how to configure filebeat (a tool from Elastic) to send log data to Kafka. It includes a docker compose demo.


Data Engineer - Python, Wooga, Berlin

Software Engineer, Value Platform, Nuna, Inc., San Francisco

Data Engineer, Starship Technologies, Tallinn, Estonia


dataArtisans, who was recently purchased by Alibaba, have renamed to Ververica. The blog has more about how their new name relates to plans for the future.


Debezium 0.9.0 Final has been announced. The release of the change data capture tool adds a new connector for SQL Server, supports the latest versions of other supported databases and Apache Kafka, and has several other improvements more. More about the release on the Debezium blog.

Version 5.2 of Databricks Runtime has been released. The new features are a new experimental time travel feature (more details in the second post), a fast Apache Parquet importer, and a new notebook feature that presents tips and advice inline as part of query execution.

FANDOM has open sourced their Athena Alerter, which is a tool for analyzing Amazon Athena queries via a Lambda function and alerting when costs are high. They have a cloudformation template for deploying the required components.

LinkedIn has open sourced the Cruise Control Frontend, a UI for managing and applying changes to Kafka clusters that are executed by the Cruise Control project. It includes a number of features that are highlighted in the introductory blog post.

Apache Hadoop 3.1.2 was released. It includes over 300 JIRAs, with improvements to Docker and GPU support on YARN, lots of improvements and bug fixes to YARN, and AliyunOSS improvements.


Curated by Datadog ( )


Leveraging Microservices & Kafka to Scale Developer Productivity (Sunnyvale) - Tuesday, February 12

Maintaining Full Data Lineage + Migration & Change Data Capture with CDAP (Palo Alto) - Wednesday, February 13


Dissolving the Problem: Kafka is more ACID Than Your Database (Denver) - Monday, February 11

Kafka Is More ACID Than Your Database (Denver) - Tuesday, February 12


What Is Apache Kafka + Wayfair's Journey with Apache Kafka (Boston) - Tuesday, February 12


Microservices & KSQL (Hamburg) - Tuesday, February 12

How to Successfully Fail with Apache Kafka + Fraud Detection with KSQL (Berlin) - Wednesday, February 13


Apache Flink @ Teralytics (Zurich) - Wednesday, February 13


Kafka and Stream Processing Meetup at LinkedIn (Bangalore) - Saturday, February 16


Developing Contextual, Event-Driven Applications with KSQL and Kafka (Sydney) - Tuesday, February 12

Using Apache Cassandra and Apache Kafka to Scale Next-Gen Applications (Melbourne) - Wednesday, February 13

Bridging from Middleware to Event-Streaming (Perth) - Friday, February 15