Data Eng Weekly

Data Eng Weekly Issue #286

21 October 2018

Several great architecture posts this week covering Apache Hadoop Ozone, Pravega, Alibaba's distributed file system, FaunaDB and Apache Pulsar. There are also interesting posts on Uber's data platform, Wallaroo, Apache Airflow, and more. In news, there's a CFP, an upcoming conference, videos from StrangeLoop, and keynote videos from Kafka Summit. Lots of great stuff to read/watch!


This presentation provides a good introduction to Apache Airflow (it's features, terminology, and more) and Embulk, a tool for bulk loading data between data sources.

This post describes how the various tiers of storage/cache work in an Apache Pulsar system and how common scenarios (writes, catch up reads, etc) interact with the cache.

The Hortonworks blog has a post about Apache Hadoop Ozone. Lots of interesting pieces to the architecture, including the usage of RocksDB for storage, Apache Ratis (which is a Java implementation of RAFT), and more. The Ozone architecture breaks significantly from HDFS—in fact, there's a proposal in the works to replace HDFS' block storage layer with that from Ozone.

Azure CosmosDB has a MongoDB API, which can be leveraged alongside the MongoDB Kafka Connect plugin to implement change data capture into Apache Kafka.

This tutorial shows how to provision a Wallaroo cluster using Pulumi and Ansible. With this setup, they show good speedups in a CSV workload across 4, 8, and 16 servers.

In this post, there's a great list of features, with explanation, that are important in a production ready workflow engine. Many of the descriptions include anecdotes and examples that were hard-won by running production systems. If you're ever inclined to write your own workflow engine, I'd suggest reading this one to get a good idea of what you're in for.

Pravega is a streaming storage engine with similar features to Apache Kafka and Apache Pulsar. This post looks at the internals of the system—the controller that is responsible for high-level cluster operations, the segment store which supports two tiers of storage, and more.

Uber writes about the evolution of their data infrastructure from one based on only Vertica to one based on Hadoop and Vertica to one based one that supports incremental inserts, updates, & deletes and incorporates Kafka to improve latency and performance. They've built an ingestion service to consolidate the logic of importing changelog data and maintaining incremental and "latest" views of a table.

Pangu is Alibaba's distributed file system. This post provides a high-level introduction to its architecture and design goals, which include compatibility with the Hadoop FileSystem API.

Indeed has written a series on their metrics and insights system. The first describes Imhotep, the open-source system that's the core of the Indeed data platform. Subsequent posts cover how they use Imhotep in their workflow, and there's an example application based on analyzing the ASF JIRA dataset. They even have a demo site where you can try out some queries on the JIRA data.

If you've worked with a modern streaming system, you've inevitable heard of event time and watermarks. This post from the data Artisan's team is an easy-to-understand explanation of these important concepts.

The FaunaDB transaction protocol is much different than most distributed databases. It avoids many of the challenges in distributed consensus (such as having high precision clocks) by batching transactions. This post has a great overview of the protocol, including many illustrations that help with understanding how the system behaves.

Common Table Expressions really help with readability of SQL queries, and they're supported by lots of databases. If you're not familiar, here's a good intro.

These two blog posts present a deep dive into Apache Pulsar. The first covers architecture (including how BookKeeper, ZooKeeper and Pulsar components fit together), Pulsar's read and write semantics, and how it compares to Apache Kafka and RabbitMQ in common failure scenarios. The second post induces failure in Pulsar to verify correctness of reads and writes during fail-over.


Post a job to the Data Eng Weekly job board for $99.


This post introduces the concept of the "negative data engineering," which is the defensive code and rules that a data engineer ends up spending lots of time on.

MongoDB, which had been licensed under the AGPLv3 license, has been relicensed to make it more difficult for cloud vendors to offer a MongoDB service without open sourcing their changes.

ZDNet has an interview with Confluent CEO Jay Kreps on the mainstream adoption of Apache Kafka, the Hortonworks and Cloudera merger, and more.

The Call for Papers for Big Data Technology Warsaw Summit, which takes place in Warsaw next February, has been extended. Submissions are now accepted through October 25th.

Data Eng Conf takes place November 8th and 9th in NYC. Tickets are on sale now and go up in price later this week.

Videos from Strange Loop 2018 have been posted online. There are lots of interesting talks on distributed systems, stream processing, data pipelines, optimizing Spark, and more.

Last week was Kafka Summit SF. Keynotes from Martin Kleppmann of the University of Cambridge, Jay Kreps of Confluent, and Chris D'Agostino of Capital One have been posted. There's also a panel with folks from Microsoft and Slack.


HUE 4.3 is released. It includes lots of UX improvements to the SQL editor, improved dashboard layouts, and more.

Apache Beam 2.7.0 is out with new Kudo, Amazon SNS, and Amazon SQS integrations, experimental support for Python on local Flink, and more.

A maintenance release of librdkafka, the C/C++ library for Apache Kafka, was announced this week. It includes a number of bug fixes and enhancements.

PostgreSQL 11 was released earlier this week. This Postgres blog has an overview of the improvements and new features in th enew version.


Curated by Datadog ( )


Stream Processing with Apache Kafka & Apache Samza (Sunnyvale) - Tuesday, October 23


StreamSets at Cargill and phData (Hopkins) - Tuesday, October 23


Druid: Operational Analytics for Event Data (Sandy Springs) - Tuesday, October 23


Using Apache NiFi to Integrate Data into a NoSQL Multi-Model Database (Arlington) - Wednesday, October 24


Information Session II: Introduction to Hadoop Ecosystem (Toronto) - Saturday, October 27


ETL and DW 3.0 with Azure Databricks Delta (Brasilia) - Wednesday, October 24


Clickstream Processing at the Financial Times (London) - Thursday, October 25


How-To Datalake and Spark (Helsinki) - Tuesday, October 23

Apache Kafka Meetup @ Paf (Helsinki) - Thursday, October 25


Paris Data Eng (Paris) - Tuesday, October 23

Apache Kafka and Streams Messaging Manager (Paris) - Thursday, October 25


Zeebe Meets Confluent: Taming Event-Driven Architectures (Berlin) - Monday, October 22


Big Data on Kubernetes + Apache Beam: What Do I Gain? (Warszawa) - Thursday, October 25


Kafka Streams and Monitoring (Zagreb) - Thursday, October 25