Data Eng Weekly

Data Eng Weekly Issue #292

09 December 2018

Lots of great content this covering a variety of topics, like Apache Pulsar, Amazon Redshift, Apache Spark, TimescaleDB, and distributed consensus in FaunaDB. There are several posts about Apache Kafka—covering its architecture, Kafka Streams, and Kafka at Paypal. In releases, there are new versions of several projects, and there's a new open-source project from LinkedIn that makes it easy to author User Defined Functions that execute efficiently in multiple execution engines.


Introducing the first AI / Machine Learning course with a job guarantee. Springboard's new AI / Machine Learning Career Track is an intensive program that will equip you to transition into a role as a machine learning engineer.

With support from their career services team, they confident that you will find a job within six months of completing the course. But if you don’t, they'll refund your tuition.

Learn more today:


This post provides an introduction to horizontal scaling, and when it's worth doing (given the complexity it adds).

The Alibaba Cloud blog has a thorough overview of Apache Kafka, covering how it relates to traditional message queues, its scaling characteristics, and its core components like brokers, topics, and partitions. The post has a number of good diagrams and pictures to illustrate these concepts.

This post is a good introduction to the new security features of Apache Kafka 2.0. There are examples, with a description of tradeoffs, of a couple of ways to configure topic-level permissions. It also discusses how to do so with KSQL and Kafka Streams.

Netflix has a post describing their approach to replicating data between cache servers in multiple availability zones. Interestingly, their architecture started with Apache Kafka for replication but moved to a S3+SQS architecture for the reasons they enumerate.

Udemy has a good post on their experience with scaling Amazon Redshift and improving its performance. They cover some of the key Redshift concepts, like key distribution (and some tips for it), how to monitor tables for best practices, and how they analyzed and resolved issues with query queuing.

TimescaleDB is an interesting extension to Postgres that adds optimized support for time series data. In this post, ShiftLeft writes why they chose TimescaleDB, how they downsample and query downsampled data, how they monitor with Prometheus, and more.

A good introduction to the Calvin protocol, which is used by FaunaDB to provide distributed consistency with high availability. The post describes the CAP protocol, the evolution of distributed databases, Google Spanner, and a bit more.

If you're in an organization using Windows, this post covers centralizing Exchange Server logs using the Amazon Kinesis Agent for Windows.

Zhaopin, which operates a large job board in China, writes about how rolled out Apache Pulsar to replace RabbitMQ and Apache Kafka. The describe the features of Pulsar and the reasoning that lead to their choice.

Hortonworks recently added support for Kafka Streams to their platform. They've published a comparison between Kafka Streams and the other streaming platforms they support: Apache Storm and Apache Spark Structured Streaming.

Quickbooks has two posts about their data platform, in which they describe their tool for automating data discovery/detecting schema changes, and track data lineage. The implementation details are sparse, but these tools might act as good inspiration (there are some very compelling screenshots!) if you're building something similar.

A good collection of hard-earned tips and tricks for Apache Spark. These include suggestions for monitoring with Graphite, analyzing GC logs with GCeasy, using JDBC sources and sinks, and some app-level optimizations.

PayPal has over 7PB of data in Apache Kafka, so they've got some good experience with building fast data products. This presentation covers their architecture, like how they use change data capture and why you should use Apache Avro for your data in Kafka.

A collection of exercises to learn Apache Kafka Streams. There's a test suite to verify that your solution is correct.

The Hortonworks blog has a post describing how to build a Kafka Streams-based application to capture and analyze sensor data from a trucking fleet. The post describes the several Kafka Streams microservices (and provides working Java code) to analyze, join, and window the various data streams.


The Call for Papers for Kafka Summits New York and London in 2019 are now open. The deadline for both is December 20th.

This article has a pretty good comparison of the responsibility breakdown between data engineering and data scientist.

The Call for Proposals for DataEngConf SF, which takes place in April, is also open through January 11th.


Post a job to the Data Eng Weekly job board for $99.


Transport is a recently open-sourced (from LinkedIn) API and framework for writing User Defined Functions that work across a number of projects (e.g. Hive, Presto, and Spark). The introductory post from the authors describes how they've architected the project for high performance by automatically generating platform-specific wrappers.

Wallaroo 0.6.0 is out with a new Wallaroo API. The release notes have a fantastic overview of what's changed in the new release of their streaming platform and how to upgrade.

DataStax Enterprise 6.7 was released. The release includes a new Kafka Connector, backups to blob storage, an enhanced Spark Connector for analytics, and more.

Apache HBase 2.0.3 is out with stability and bug fixes.

Version 2.5.2 of Apache Kylin, the OLAP interface for the Hadoop ecosystem is out.

Apache CouchDB released version 2.3. The new version has improved security, better performance, bug fixes, and several operational improvements.

Apache Impala 3.1 is out. It includes some major new features, like support for ORC files.

A calculator, in OpenDocument spreadsheet format, for modeling Kafka clusters (including instance types/costs for various cloud providers).


Curated by Datadog ( )


Bay Area Flink Meetup @ Mix (Mountain View) - Tuesday, December 11


The Present State of Operational Analytics and Apache Druid (Atlanta) - Thursday, December 13


Scalability with Apache Kafka & Overcome the Challenges of Being a Woman in Tech (London) - Tuesday, December 11


Using Apache Kafka from Go by Javier Sanz (Madrid) - Tuesday, December 11


Apache Kafka (Brest) - Monday, December 10

First Apache Kafka Lyon Meetup! (Lyon) - Monday, December 10


Kafka at Klarrio (Antwerpen) - Tuesday, December 11


ATM Fraud Detection with Apache Kafka and KSQL (Frankfurt) - Monday, December 10


Event Driven Microservices with Kafka & Evolution of the Data Pipeline in Agoda (Singapore) - Tuesday, December 11

Good Company Series ft. Shopee: Data Engineering & Analytics Sharing (Singapore) - Thursday, December 13