Data Eng Weekly

Data Eng Weekly Issue #259

08 April 2018

Lots of great content this week—stream processing with Apache Kafka, consistent hashing strategies, several posts on data infra architecture, and two posts related to Apache Hive performance. Flink Forward is this week and DataEngConf is next week (see News for a discount code). In releases, Apache Hadoop 3.1.0 is out, Apache Hive released a new version to fix some security issues, and ElastiCache is a cluster provisioning tool that recently gained support for Hadoop and Spark.


Are you looking to elevate your data culture Do you want to work with stunning colleagues, solve complex data problems and contribute to the best video viewing experience in the industry? Come press ‘play’ with Netflix. We’re hiring for many types of data roles. Look for us at the upcoming DataEngConf and say hello to learn more.


This post describes one company's journey with BigQuery—the advantages it provides in performance and usability but also how that can be a double edged sword because it's easy to forget to optimize when it's so easy to use more compute (and money) for a query.

While it's been around for a while, I only recently came across the WSO2 streaming SQL system. In addition to analyzing data from Kafka, the open-source engine can take in data from lots of other sources (like HTTP, MQTT, JMS, and email). This post provides a good introduction to its SQL syntax, including how to use it for lots of common operations.

If you want to test out an unreleased or custom build of Spark, this post walks through the build and deploy steps for running on a YARN cluster or Kubernetes. The tutorial is tailored to Google Cloud Platform, but most of the steps are broadly applicable.

This post is a great overview of consistent hashing algorithms, which are a common building block in a distributed system. The post starts off with the classic ring-based consistent hashing mechanism and goes into more recent inventions like Jump Hash, Multi-Probe Consistent Hashing, Rendezvous Hashing, and Maglev Hash (three of the four are from Google).

This post summarizes a recent presentation on building and deploying ML models at They deploy docker containers using Kubernetes (which has some experimental support for GPUs) and, for serving, pull the pre-trained models from HDFS.

The Teads data analytics team writes about their data infrastructure, which is spread across Google Cloud Platform for BigQuery and DataFlow and Amazon Web Services for Kafka and Redshift. They describe a number of key architectural decisions that they made for ingesting, storing, and querying data in BigQuery as well as some limitations. Lots of good tips if you're considering a similar architecture (including some gotchas related to pricing to keep an eye on).

MR3 is a new execution engine for Apache Hadoop. It claims to have feature parity to Tez with improved performance. This two part series describes the performance gains when running Hive on MR3 vs Tez.

Hortonworks has analyzed Hive LLAP runtime when data is stored in HDFS or Amazon S3 (tiered vs. decoupled storage strategies) for cloud workflows, and they found that there's a small fixed cost overhead of storing data in S3 rather than HDFS (at least for their test scenario).

A quick overview and performance evaluation of Spark RDDs and DataFrames. The performance eval uses the Wikipedia dataset and some simple aggregations both with Spark and PySpark.

Using the Syslog Apache Kafka Connect plugin, you can get syslog data into Kafka in Avro format for analysis. This post shows how to, once you've collected that data, perform analysis using KSQL.

Great overview of the stream and table concepts in stream processing. The post has lots of animations to help illustrate the relationship between the two, and it has some examples in Scala and KSQL for building a table from a stream.

This tutorial walks through running Apache Kafka on Microsoft HDInsight and hooking it up to a Spark databricks cluster for stream processing.


Flink Forward is this week in San Francisco. DataArtistans have been previewing the sessions on their blog the past few weeks—here's part 6 of their coverage.

DataEngConf, as mentioned last week, is in just over a week in San Francisco. The conference is offering 25% off for readers of Data Eng weekly, using the code DEWEEKLY.

The InfoQ podcast has an interview with Uber's Danny Yuan on their streaming systems, which include Apache Kafka, Apache Flink, Apache HDFS, and more. They process around 1 million messages per second, which feeds into OLAP systems, ML models, and more.

Redmonk has analyzed the state of purpose-built time series database engines as well as some general purpose databases that are often used for time series data. They look at popularity on github and stack overflow as well as discuss the large amount of fragmentation we're currently seeing.


Are you looking to elevate your data culture Do you want to work with stunning colleagues, solve complex data problems and contribute to the best video viewing experience in the industry? Come press ‘play’ with Netflix. We’re hiring for many types of data roles. Look for us at the upcoming DataEngConf and say hello to learn more.


HUE 4.2 is out. The release focuses on cloud, analytics, and supportability. More details on those improvements in the release notes.

Apache Hadoop 3.1.0 was released with the caveat that it's not yet considered production-ready. Major enhancements include updates to S3 support/performance and new features of the capacity scheduler. YARN also gets CPU and FPGA features as well as support for long running services.

Apache Hive 2.3.3 is out. It contains security fixes related to 3 CVEs, one of which has been around as far back as Hive 0.6.0.

Scio, the Scala library for Apache Beam and Google Cloud DataFlow, has released version 0.5.2 with new JavaConvertors implicits, custom validation for BigQuery, and more.

ElastiCluster, a tool for provisioning compute clusters, has added support for Hadoop and Spark clusters by way of BigTop. It's built with Ansible, has great documentation, and is licensed GPLv3.


Curated by Datadog ( )



Airflow Meetup @ WePay (Redwood City) - Wednesday, April 11


Cleveland Red Hat User Group Meeting (Cleveland) - Tuesday, April 10

New York

High-Performance Data Analytics & Visualizations for Volume, Variety, Velocity (New York) - Tuesday, April 10


Implementing Microservices with Domain Events + Replacing MirrorMaker (Vancouver) - Tuesday, April 10


GCP Meetup at Google (Stockholm) - Tuesday, April 10


Apache Kafka Meetup @ Zalando (Helsinki) - Wednesday, April 11


Creating a Big Data Architecture with Apache Spark (Barcelona) - Thursday, April 12


Collaborative Music with Kafka! (Bordeaux) - Thursday, April 12


Lugano Tech Talks (Lugano) - Tuesday, April 10


Deep Learning on Hadoop (Prague) - Thursday, April 12


Event-Driven Architecture with Kafka Streams (Katowice) - Friday, April 13

RUSSIA Introduction to Hadoop and Spark (Moscow) - Thursday, April 12


Real-Time Data Pipelines with Apache Kafka & Java Futurity (Bangalore) - Saturday, April 14


Sydney Data Engineering Meetup (Surry Hills) - Thursday, April 12


ZA HUG #4 (Johannesburg) - Thursday, April 12