Data Eng Weekly

Data Eng Weekly Issue #305

17 March 2019

Topics this week include OpenTracing, structured logging, scaling Apache Airflow, dbt for ETL, joins in Apache Spark, and the new Quarkus framework. Quite a lot of variety—hopefully someone for everyone!


Sematext has a five part series on OpenTracing, including basics and terminology (like Spans and Baggage) and an overview of two implementations: Jaegar and Zipkin.

This post provides an introduction to the features of dbt, a tool for efficiently building and executing ETLs. One of the more exciting features a mechanism for testing schema and data for SQL queries.

Grab writes about the advantages of structured logging (including better root cause analysis, better observability, and better standardization), and the libraries and tools they're using for their structured logging stack. Among them, they have a fixed schema for their structured log records (with a library that can code generate an API for producing the messages).

A good list of design principles for building systems using an event driven architecture. Several principles describe ways to be defensive in producing data and others focus on designing for backwards compatibility.

Astronomer has a discussion of the trade-offs of running a single multi-tenant Airflow deployment vs. running multiple deployments that are divided up by functional area. In their experience, multiple deployments is typically the way to go.

The Apache Flink blog has a post describing how to use Prometheus to monitor Flink jobs.

This tutorial shows how to mirror data from Google BigQuery to Amazon S3 and import it into AWS Athena. The post breaks it down into 6 steps, and the last 4 (which cover extracting Apache Avro schemas and creating Apache Hive tables) are useful even if you're working with Avro data in S3 without using BigQuery at all.

Hue has a neat new feature for visualizing and debugging Impala queries. It shows the execution DAG and a detail pane provides more information about what is happening at each step.

This post describes the various types of join in Apache Spark (e.g. hash join, merge join), when each is used, and some configuration that can be adjusted to optimize join performance.

This article covers how Apache Kafka Connect handles and surfaces errors, including by writing messages to a dead leader queue (optionally with Kafka Message headers containing details about the error) for reprocessing.

The Debezium blog has a post on writing applications to consume data from Apache Kafka using Quarkus, which is a new a new framework for building Kubernetes-native applications (including, using GraalVM, native binaries with very fast startup times). Quarkus' Kafka integration uses a reactive messaging API from the MicroProfile framework, which the code samples demonstrate.

The Alphahealth team has an overview of validating, evolving, and anonymizing JSON data. The post references a number of tools, including JSON Schema, JSON Schema Validator, JSLT Expressions (for transforming data between two schemas), and more. There's some example code for each of these and discussion of how they deploy the code with Lambda.


Senior Data Engineer (Spark), N26, Berlin

Software Engineer - Data Platform, Fitbit, San Francisco, CA


DataOps Summit is a new conference from the StreamSets team that will take place in September. The Call For Speakers is open until April 5th.


Version 1.9.0 of Apache Kudu was released. It includes location awareness, a number of new CLI tools, docker scripts, and improved testing utilities. It also has a number of optimizations, improvements, and fixes.

YugaByte DB, the distributed SQL system, released version 1.2.0. It has several new features for SQL and transaction support, new datatypes, and more.

Astronomer announced version 0.8.0 of their platform, which includes Apache Airflow 1.10.2 and a number of updates to their Kuberenetes-based platform.


Curated by Datadog ( )


Stream Processing with Apache Kafka & Apache Samza (Sunnyvale) - Wednesday, March 20

Druid & Kafka: Swimming in the Data River, Or, When “Streaming Analytics” Isn’t (San Francisco) - Wednesday, March 20


Deploying Kafka Streams Applications with Docker and Kubernetes (Portland) - Wednesday, March 20


Confluent and Nordstrom Discuss Events, Logs, and Microservices! (Seattle) - Tuesday, March 19

IRELAND Data Engineering Workshop (Dublin) - Saturday, March 23


How to Contribute to Spark 3.0? With Holden Karau (Barcelona) - Tuesday, March 19


Kafka, the Power of Events and Unbounded Data (Paris) - Tuesday, March 19

Kafka Night (Toulouse) - Tuesday, March 19


A Spring with Stream Processing and Apache Flink (Utrecht) - Wednesday, March 20

Apache Kafka, KSQL, Demos & (Amsterdam) - Thursday, March 21


Flink @ Workday + Image Processing with Flink (Munich) - Thursday, March 21

Open Source Infrastructure (Berlin) - Thursday, March 21


SQL Over Hadoop: Designing, Developing, and Supporting (Wrocław) - Tuesday, March 19


Kafka and Stream Processing at Gojek (Bengaluru) - Saturday, March 23


Apache Kafka on AWS & Serverless: State of the Union (Johannesburg) - Monday, March 18

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.