Data Eng Weekly

Data Eng Weekly Issue #275

29 July 2018

Lots of great content this week on topics from high-level architectural patterns for microservices, data platforms, and distributed systems to looks at systems internals (e.g. Kafka authorization and Qubole's spot instance framework) to sharing lessons from building out data platforms in adtech and for a museum. There's also news and announcements about several conferences, releases, and new projects.


Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at or visit to learn more.


This post goes through examples of each of the eight fallacies of distributed systems and provides some common architectural patterns for each.

Debezium, the change data capture system, uses the database's own log files (e.g. the transaction or bin log). This has a number of advantages over polling the database for changes—five of those advantages are highlighted in this post.

Here's a description of the bug in Kafka’s authorization layer underlying a recently disclosed vulnerability. THere's also a look at the patching/disclosure process and what other work has been done to verify the authorization implementations in the project.

As more and more systems support common formats and data stored in cloud object stores, it opens up opportunities to simplify the number of copies of data. This post provides some useful terminology to talk about the flavors of data architecture and their tradeoffs.

A look at the data architecture and data engineering at Unruly, which is a company in the adtech space. Unruly is using both AWS and Google BigQuery with Apache Airflow to drive workflows. The post talks about the evolution of their platform, such as their move from bash to python and their future option of replacing BigQuery with Athena to simplify the architecture.

This post describes how to use the explode and collect_list operators in Spark to substitute the values in an array based on a lookup table.

Qubole writes about the components they've implemented to optimize performance in the face of spot instance termination in AWS. They've added functionality to respond to spot termination events to decommission the instance housing HDFS DataNodes (to ensure no new blocks written) and proactively cancel Spark executors rather than waiting for them to timeout. These features, coupled with their other modifications to work with spot instances, improve performance and reliability while keeping costs low.

This article is about the data platform of a museum, which collects a number of data points through an interactive pen that visitors use. The data is processed using Logstack and ElasticSearch, and they've built a data warehouse using Amazon Redshift. It's an interesting look at a really different application than your normal internet company data platform.

This post recaps a talk from the recent Dataworks Summit about upgrading Hadoop 2 to Hadoop 3. It covers the main changes and gotchas that have to be considered when upgrading a cluster.

Cloudera has the second part in their series on implementing Avro data serialization atop of Apache Kafka. This post looks at building a simple in-memory store as well as builds the first parts of a schema registry that stores its schemas in a Kafka topic.

A great overview of microservices communication patterns and their trade-offs. These include: service-service communication, API Gateway, client-side aggregation, and event bus. There's also some advice for how to get off the ground with a new project.


Unravel released a step-by-step video on how to make Kafka and Spark streaming applications fast and reliable. Watch it now.


Presentation slides from the recent Presto Summit have been published. Both some interesting use case and project internal talks.

DataEngConf NYC takes place in November. The CFP is open through early September—they’re looking for 15 or 30 minutes talk proposals across data engineering and data science.

At Google Cloud Next, Google and several partners announce Knative, which is a new project to bring functions-as-a-service (a la AWS Lambda) to Kubernetes.

Several talks from Kafka Summit have been announced. The conference takes place in San Francisco in October, and early bird pricing ends in just a few days on August 1st.


The Data Eng Weekly board has several jobs:

Submit your own job at


Astronomer has released version 0.3.0 of their managed Apache Airflow product. The release runs on Kubernetes to support major clouds and on-prem deployments, improves high availability, and has improvements to the command line utility.

If you’re using Gradle, here’s a new plugin for the Confluent schema registry.

Apache HBase 2.1.0 was released. The new version includes over 240 changes, including improvements to replication and an experimental rolling upgrade capability to migrate from HBase 1.x to 2.x.

Kafka Graphs is a new graph processing framework for Apache Kafka. This post describes the basics of the implementation, which is built using the vertex-centric, Pregel model.

Jaquet is a new open source project for consuming JSON data from Apache Kafka and storing it as Apache Parquet in Amazon S3. This post talks through the motivation and design of the system.

Version 1.2 of Reaper, the repair too for Apache Cassandra, has been released. It includes performance improvements, a new authentication layer, and several other usability improvements.

Apache Accumulo 1.9.2 is out with critical bug fixes in recovery and concurrency. It's recommended for all users of Accumulo 1.8 and 1.9.

Debezium 0.8.1 and 0.9.0.Alpha1 were released. The 0.9.0 release will add connectors for SQL Server and Oracle. 0.8.1 has some bug fixes.

Confluent has released ansible playbooks for deploying Confluent Platform via systemd and for enabling encryption and authentication (including generating certificates for SASL).


Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at or visit to learn more.

Unravel released a step-by-step video on how to make Kafka and Spark streaming applications fast and reliable. Watch it now.


Curated by Datadog ( )



Productionalizing Spark Streaming Applications (Tempe) - Thursday, August 2


Apache Spark Concepts Primer (Saint Louis) - Wednesday, August 1


Moving and Syncing Big Data for Ecommerce (Jacksonville) - Tuesday, July 31


Data Streaming, Application Processing, ML on a Hybrid-Cloud Platform (West Des Moines) - Thursday, August 2

IRELAND Talks on Kafka and All Things Tech (Dublin) - Thursday, August 2


Embrace the Anarchy: Apache Kafka's Role in Modern Data Architectures (Munich) - Wednesday, August 1

Real-Time Big Data Processing with Spark Streaming and Kafka (Frankfurt) - Thursday, August 2


Data Engineering Talks (Singapore) - Wednesday, August 1


July Data Engineering Meetup (Sydney) - Monday, July 30

5 Steps to Build Streaming Systems, with Neha Narkhede (Sydney) - Tuesday, July 31