Data Eng Weekly

Data Eng Weekly Issue #260

15 April 2018

A couple of great debugging stories, tips for testing a distributed system in the face of failure, details on the Presto cost-based optimizer, and more great technical posts. There are also several good posts on the future of the big data industry and research communities.


Are you looking to elevate your data culture Do you want to work with stunning colleagues, solve complex data problems and contribute to the best video viewing experience in the industry? Come press ‘play’ with Netflix. We’re hiring for many types of data roles. Look for us at the upcoming DataEngConf and say hello to learn more.


TiDB, which is a new distributed database with MySQL compatibility, has written about how they test their system by injecting faults. They describe a number of fault injection tools, and their Kubernetes-based platform for automatically running these tests.

The Uber engineering team has written a post describing their work over the past years to scale HDFS for large data volumes and number of queries. A lot of the scalability challenges are with the NameNode, and they have solutions like physically partitioning clusters (but presenting a unified view via ViewFS) based on usage characteristics, reducing the number of small files, tuning the NameNode garbage collector, and a new read-replica NameNode (called an Observer NameNode).

Using the TPC benchmark, this first post compares Presto with the new cost-based optimizer vs. without it (there are some impressive speedups—half fo queries got a 2-5x speedup). The second post describes how the optimizer works and why this results in significant speedups. It's a good technical article, going into join re-ordering, projections, and optimization considerations.

This post has a great overview of monitoring Apache Kafka consumer lag—the use cases and SLOs at ZipRecruiter, how to use the Kafka CLI tools along with a simple parsing script to get this information, and several example graphs that capture interesting consumer lag events.

Channable has written about debugging a performance issue with a long-running Apache Spark application. The post goes into lots of interesting details on JVM internals (e.g. custom class loaders and garbage collection of classes), Spark internals (e.g. how the driver cleans up broadcast data), and their metrics and monitoring strategy that enabled hunting down these bugs and verifying their fixes.

The Hortonworks blog has an analysis, based on code changes and JIRA tickets, of contributions to Hadoop 3.1.0.

If you're using Kubernetes, here's a good list of best practices to secure your cluster. It draws from a bunch of other documentation, and also links out to several other resources.

This repo has a good collection of links to articles, videos, and presentations on distributed consensus algorithms (e.g. Paxos, Raft, Zab).

This is a fun pair of articles about executing C functions as UDFs in BigQuery by compiling the code to web assembly/javascript. Performance starts out pretty slow, but batching the data before UDF call results in some massive speedups. Maybe not the most practical, but a clever solution nonetheless.

While this article is aimed mostly at the DB research community, it has a great overview of accomplishments from the past 10 years (including the rise of the Hadoop ecosystem and research on distributed databases) as well as upcoming research areas (like learned indexes) that might make their way into production systems sooner than later.

The first of these two posts surveys the options for working with immutable event data in Apache Kafka in the face of the GDPR. Of these options, it proposes a solution for implementing the encryption-based solution. The second article has a number of details on implementing this method, which is built using Kafka streams and a Kafka topic containing encryption keys.

Replacing Amazon Redshift with Amazon Athena can provide massive cost savings if performance is still good enough. In this post, the author describes several improvements—including improved file formats and batching queries—to further reduce cost in a use case with relatively small data sets.

This post describes how to perform some heuristic-based anomaly detecting from syslog data using a KSQL program and how to use the Python Kafka connector to send resulting alerts to Slack.

I always like a good debug story—this one is about debugging performance issues in Impala caused by a change in the query pattern.


The ARCHITECHT Show is a podcast that folks might find interesting. This episode talks with Stephan Ewan, Co-Founder and CTO of dataArtisans, about Apache Flink, stream processing, and more.

Designing Event-Driven Systems is a new eBook, which is available for free behind an email/phone form from Confluent.

This post recaps the Snowflake "Unite the Data Nation" conference. There are some good observations on changes to data architectures (separating compute from storage and IoT adoption) as well as a list of several Snowflake features.

I was the guest on the Data Engineering Podcast this week, where we discussed the process of writing this newsletter and some of the data engineering problems I've worked on.

Cloudera held its industry analysts event last week, and this post highlights five questions that are facing Cloudera (and likely several other vendors in the same space). Everything from the the volatility of the stock market to questions about the future of the company (and in some ways, the broader big data ecosystem).


Are you looking to elevate your data culture Do you want to work with stunning colleagues, solve complex data problems and contribute to the best video viewing experience in the industry? Come press ‘play’ with Netflix. We’re hiring for many types of data roles. Look for us at the upcoming DataEngConf and say hello to learn more.


AWS Glue now supports Apache Spark 2.2.1, and Amazon EMR is now supporting Spark 2.3.0, Apache HBase 1.4.2, and Presto 0.149.

Apache BookKeeper 4.6.2 is out with fixes to Java 9/10 support, performance improvements, and other fixes.

Version 1.6.0 of Apache NiFi was released last week with a number of new and improved processors (HBase, Druid, Mongo, Influx) and UI/UX improvements for multi tenancy and managing permissions.

MapR has released an update to support Apache Drill 1.13. Among the new features is support for SQL on MapR streams. The post also highlights some of the features of the release, including several performance enhancements.


Curated by Datadog ( )



Apache REEF (Bellevue) - Wednesday, April 18


Elasticsearch for Apache Hadoop Use Cases and Configurations (Denver) - Thursday, April 19


Data Science Project Using Spark, Spark Streaming, ML, & Kafka with Talend (Houston) - Thursday, April 19


Children’s Healthcare of Atlanta: NextGen Sequencing Using Hadoop, Spark, & Kudu (Atlanta) - Thursday, April 19

North Carolina

Hadoop and Enterprise Workflow: Getting Your App Implemented for Maximum Value (Raleigh) - Wednesday, April 18


Huge Steps for Apache NiFi: Talks on NiFi in Finance and Powerful New Features (Laurel) - Tuesday, April 17


LeedsDevops Meetup (Leeds) - Tuesday, April 17

Data Processing and SecOps at Scale (London) - Wednesday, April 18


Future of ETL Isn't What It Used to Be, with Gwen Shapira (Paris) - Tuesday, April 17


Apache Flink Meetup Berlin x DataWorks Summit (Berlin) - Tuesday, April 17

Tim Berglund: Stream Processing with KSQL (Cologne) - Wednesday, April 18

Confluent and Payback Talk about Kafka, KSQL, and Spark (Munich) - Wednesday, April 18


Kafka Beijing Meetup (Beijing) - Saturday, April 21