Data Eng Weekly

Data Eng Weekly Issue #306

24 March 2019

Lots of coverage of Apache Spark this week as well as articles on Apache Kudu, Postgres, Apache Kafka Streams and more. And in releases, MR3 had a new release, kaf is an interesting new CLI tool for Kafka, and Prefect has open sourced the core library of their workflow engine.


This post describes several aspects of running the BigQuery Kafka Connect plugin, such as rate limiting, how the connector handles deletes, and handling data deduplication (or lack thereof).

The Apache Kudu blog has an overview of the KuduTestHarness or testing JVM applications that are using Kudu.

The pgDash blog has a post on the configurations for CPU, memory, network, and more that can be tweaked to horizontally scale your Postgres deployment on beefier hardware.

A look at using Mode, a hosted service for Python and R Notebooks, with data in Amazon Redshift. The post also describes the evolution of BI workflows at most companies and how to use Fivetran to ETL data into Redshift.

Databricks Delta supports the SQL MERGE command, which can be used to update records in a Databricks Data Lake. Their post covers items like deleting records for GDPR compliance and applying updates based on change data capture.

The Qubole blog has a post on the Kinesis Connector for Spark Structured Streaming, which covers the connector's architecture and has an example of a streaming job.

This post provides an extensive overview of Apache Spark's windowing functions for things like ranks within an ordered window, lag & lead to compare to previous/next/other values in the window, and more. The post provides code to illustrate each of these by executing against a sample data set.

The Confluent blog looks at the new Suppress operation in Apache Kafka Streams, which can be used to simplify use cases like alerting. Suppress supports both a time delay and waiting until a particular window has closed to trigger an alert. The post also covers some considerations related to in memory buffering.

When running an EMR cluster, you have the option of storing data in S3 or in HDFS on the cluster. This article describes a number of options for copying data between S3 and HDFS, and it shows how (based on a couple of simple Presto queries) querying data in HDFS can be much faster.

This article covers many of the main performance tuning parameters of an Apache Spark job, such as dynamic allocation, parallelism, and speculative execution.

An intro to the new EXCEPT ALL and INTERCEPT ALL SQL operations in Apache Spark 2.4.0.

This post looks at building a MapReduce framework form scratch on Kubernetes. The system is written in Go and uses HTTP for transport. While far from a production system, it's interesting to see what building MapReduce from first principles in Kubernetes might look like.


Senior Data Engineer (Spark), N26, Berlin

Software Engineer - Data Platform, Fitbit, San Francisco, CA

News is joining the LInux Foundation. The project enumerates and principles (one example is "Recognize and mitigate bias in ourselves and in the data we use.") and has a courseware on github.

The International Conference on Extending Database Technology is this week, and the papers have been posted online. Lots of interesting content including on Spark SQL and KSQL.


A new release of MR3, the execution engine for Apache Hive, has been released. The project has posted some performance comparisons, too.!msg/hive-mr3/EyZeAuBH_FQ/BdhbMPnGBwAJ

Kaf is a command line utility for Apache Kafka written in Golang. The authors credit kubectl and docker for inspiration.

Apache NiFi 1.9.1 is out. It's a maintenance release with improvements to SFTP, JSON record readers, and more.

Version 2.6.1 of Apache Kylin, the Distributed Analytics Engine, is out. There are over 25 issues included in the release.

Version 0.6.0 of Apache NiFi MiNiFi was released. Among the features are support for natively written python processors and a new structured logging library.

Prefect has open sourced their workflow engine core library. Much more about the features of the library and the platform in their introductory post.


Curated by Datadog ( )



Bay Area Apache Flink Meetup (San Francisco) - Monday, March 25


DesertPy: Calling Native C Code and Using Kafka (Scottsdale) - Wednesday, March 27

New York

Two Sigma Open Source Meetup (New York) - Monday, March 25

An Introduction to Streaming Data and Stream Processing with Apache Kafka (Webster) - Wednesday, March 27


Toronto Apache Spark 2.0 (Toronto) - Wednesday, March 27


ETL in Azure Made Easy with Data Factory Data Flow (Bristol) - Tuesday, March 26

"Everything Data" Launch: Exploring Data Engineering (Belfast) - Tuesday, March 26


Apache Kafka: Tips from the Trenches, or How to Fail Successfully (Madrid) - Tuesday, March 26


Kafka @ Accor & Gekko: Real-Time Issues and Big Data (Paris) - Tuesday, March 26

Beyond Brokers: A Tour of the Kafka Environment (Villeurbanne) - Thursday, March 28


Kafka Is Not Just an Author (Hamburg) - Thursday, March 28

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.