Data Eng Weekly

Data Eng Weekly Issue #307

31 March 2019

Over the years, the topics covered in this newsletter have grown from the Apache Hadoop ecosystem to Data Engineering and Distributed Systems to various adjacent topics. I've followed the content where it's interested me, and emails have gotten quite long trying to encompass all the relevant news, technical posts, and releases. So I'm trying something new with this weeks issue—I'm slimming down the newsletter and choosing the best articles across all of those sections. Going forward, expect to see about 10 links each week. For everyone who has sent an article my way—thanks! Please continue doing so but don't take it personally if I don't end up including something you pass along.


Amazon Web Services published two articles in the past couple of years about their hosted database, Amazon Aurora. The Morning Paper summarized some of the key topics covered in those paper, including how Aurora decouples its storage tier, how the design considers Availability Zones for quorum writes, the log system maintained by the storage tier, and how all of these achieve fault tolerance and high throughput. While as an AWS user, something Aurora might be a black box, it's really interesting to see how the folks at AWS design a distributed system to thrive in AWS.

There's a common tension in many organizations between standardization and autonomy. At Netflix, they have lots of teams using lots of different tools, which means there isn't much standardization to take advantage of when building a data lineage system. Netflix writes about how pull metadata from many different sources (including Lipstick to gather information about Apache Pig scripts, the Meson scheduler API, and Amazon S3 access logs) to reconstruct lineage and enrichment data about their data pipeline. The metadata ultimately lands in a graph database, which is frontend by a REST API.

Cadence is a workflow and orchestration tool with a Golang SDK. This tutorial shows how to author your first workflow with Cadence. There are some minor changes, like using the SDK's routines and channels rather than the go-native versions. Cadence uses Cassandra for storage, has a web UI for visualizing workflows, and has a server/worker architecture that integrates well with Kubernetes and similar tools.

Version 0.9.3 of Debezium, the chance data capture tool for databases, was released. While it's a minor release, it includes an implementation of an EventRouter for the outbox pattern. If you aren't familiar with the outbox pattern, there's a link to a blog post about it in the release notes—it's a clever design pattern for publishing events about your system that seems quite promising.

Distributed tracing tools offer great support for request tracing in HTTP or RPC services. But you may not realize they can also be used for tracing outside of the request path for systems like Apache Kafka. This post describes how to use Kafka's interceptors to add tracing data to Kafka Connect, KSQL, and other Kafka producers/consumers.

Apache Kafka 2.2 was released this week. There are a lot of new features in the release, which are summarized in this video+screencast. I really enjoyed this one (it's a great length at under 15 minutes), and I'm not normally a video person.

Dream11 (a fantasy sports website in India) tells the story of scaling their real-time fraud engine. They started with a relational database, and they introduced change-data-capture using Debezium to send data to Apache Cassandra. Their post describes their requirements (e.g. they have large bursts during the World Cup and other sporting events) and what other tools they use and have evaluated.

CDNs are easy to take for granted these days—if you're not familiar (or need a refresher) then this is a great intro to how they work. This implementation uses Route53's latency-based routing (which sends a request to the closest region), Kubernetes' ExternalDNS, and Terraform to spin up Kubernetes clusters in multiple AWS regions.

I've often heard of talk of mirroring data from a production database to a lower environment for testing, but it's often hard to follow through on that. This post describes how to use Google Cloud Functions with Cloud Scheduler to run a script that does the mirroring across BigQuery datasets, which can achieve that goal.


Curated by Datadog ( )


Apache Druid Bay Area Meetup @ Unity Technologies (San Francisco) - Tuesday, April 2


Exactly Once Processing with Flink and Pravega (Seattle) - Thursday, April 4


Building Zero Data Loss Pipelines with Apache Kafka (Tempe) - Wednesday, April 3


Big Data Tools: DataBricks (Oklahoma City) - Thursday, April 4


Seasons of Streaming (Saint Louis) - Wednesday, April 3


Trends Affecting System Architectures: Streaming Data, Machine Learning, Etc. (Chicago) - Tuesday, April 2

Operationalizing Kafka in the Cloud, Benchmarks, and SpotHero's Uses (Chicago) - Thursday, April 4

New York

Apache Kafka and Kubernetes Day (New York) - Wednesday, April 3

Is Apache Kafka and IBM Event Streams Right for Your Event-Streaming Needs? (New York) - Wednesday, April 3

Stream Processing Beyond Streaming with Apache Flink, by Co-Creator of Flink (New York) - Thursday, April 4


Introduction to Apache Spark Hands-On (Toronto) - Tuesday, April 2


Apache Beam Meetup 6: Detecting Fraud @ Revolut & Tech Deep Dive (London) - Wednesday, April 3


Benchmarking Stream Processing Frameworks and Parquet Optimizations (Amsterdam) - Wednesday, April 3


Development of Event-Driven Systems in Apache Kafka (Frankfurt) - Tuesday, April 2

Data Reliability Challenges with Apache Spark (Munich) - Wednesday, April 3


Spark Structured Streaming + HBase Peers Part 2: CDC on Apache Kafka (Bologna) - Tuesday, April 2


Microservices, Kafka Streams, and KafkaEsque (Wien) - Tuesday, April 2


Zonky Kafka Meetup (Praha) - Tuesday, April 2


AWS Big Data Architecture Lessons Learned 1.2 (Tel Aviv-Yafo) - Monday, April 1

Kafka Streams Applications (Tel Aviv-Yafo) - Wednesday, April 3


Sydney Data Engineering Meetup (Sydney) - Thursday, April 4

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.