Data Eng Weekly

Data Eng Weekly Issue #281

16 September 2018

New open source projects from Facebook, LinkedIn, Two Sigma, and Oath this week. Several great posts about company's data experiences—the Netflix Keystone platform, Hike's experiences with BigQuery, Clio's experience sharding a production database, nextgen timeseries database at Pinterest, optimizing Redshift at Plaid, and more. And based on some of the news out of Strata, it sounds like Hadoop is really getting ready to ride the Kubernetes wave.


Crunch Data Engineering and Analytics Conf (Oct 29-31, Budapest) is offering you a $60 discount with code WeeklyCrunch on Regular and Late Bird tickets. Google, Slack, Apache Arrow… full lineup below.


Azure Data Factory is a tool for visually designing and running ETLs between various systems (it has a bunch of connectors). This tutorial demonstrates setting up a job to load data from blob storage to a SQL database.

Hike shares their experiences in moving from a Hive-based ad hoc analytics system to Google BigQuery. They saw good speedups, especially after making use of clustered tables. They detail their tooling and why they enabled require_partition_filter as a guard rail. Overall, they're seeing 50x speedups and half the cost.

Clio recently went through the process of sharding their online MySQL database, and they've documented the details of the transition. Among these, they applied a regex to detect which operations contained joins and transactions that might be problematic. Lots of practical advice if you're facing something similar.

Autotrader has a good walkthrough of setting up Spark to send logs to Logstash using the logstash-gelf library.

The Plaid technology blog has a great overview of how they analyzed Redshift performance of the queries powering their BI dashboards in Periscope and what changes and improvements they made. These included some well known patterns like VACUUM/ANALYZE and also adding jobs to their Airflow workflow to precompute some rollups.

Keystone is Netflix's platform for real-time stream processing for analytics. It's built on Apache Kafka and Apache Flink (in addition to a number of Netflix tools). This overview shows just how big the challenges are for building a multi-tenant tool at their scale—all the various flavors of stream processing are needed. The post then describes how they've built the system to meet those requirements and to be self-service with good operational characteristics.

"Streams and Tables: Two Sides of the Same Coin" formalizes some of the key concepts in Kafka Streams. It describes the trade-offs related to processing and event time, and walks through the Kafka implementation as a case study.

Heap analytics has a fascinating debug story about how unexpected ClassLoader behavior led to problems in their Flink jobs. The post includes a lot of great JVM debugging tools to add to your tool belt (javap, BTrace, and -verbose:class).

This presentation discusses the evolution of the Hadoop ecosystem, and it argues that we are currently in a state of "deconstructed database." That is, there are a number of components—storage, query model, data exchange, etc. that have evolved and can often be swapped in and out. The slides close with some predictions about the future.

Pinterest writes about their OpenTSDB replacement, Goku, which is wire-compatible and written in C++. The post talks about the architecture of the system (data replication and disk-based storage are forthcoming) and describes its performance characteristics. Pinterest is known to be big users of Apache HBase, so it's notable that they had enough practical challenges running a large cluster for OpenTSDB to motivate building a non-HBase replacement.

This KSQL tutorial shows how to build a streaming application to track music stream events to build an all-time (and last 30 seconds) play count.

The AWS blog has published a sample Complex Event Processing application built on Apache Flink and EMR. It's built to detect brush fires based on sensor data.


Mount Sinai School of Medicine is hiring Data Engineers; come work on cool research and important applied problems in NYC's largest healthcare system! to apply


Hortonworks has announced the Open Hybrid Architecture Initiative, which is a project to improve hybrid architecture of Hortonworks' products. There are three phases—containerization, separating compute and storage, and Kubernetes / OpenShift integration (in partnership with RedHat and IBM). Along with a glimpse at future plans from Cloudera (second post) who is looking at moving to Kubernetes (MapR has supported K8s for a while), we're liking seeing the beginning of the end of YARN for workflow management.

Videos and slides from Flink Forward Berlin, which took place two weeks ago, have been posted.

Strata NYC was this week. This article has coverage of a number of announcements and themes from the conference.


Have you checked out the Data Eng Weekly job board yet? New job this week:

Linux Big Data Engineer, G-Research, London:

Post a job for $99.


Flint is a new open source library for time series data from Two Sigma. It provides primitives for manipulating entire time series, such as joining, windowing, and resampling. This blog post gives an overview of the API, which is Python.

Yahoo/Oath have open-sourced the Oak library that implements a hybrid on-heap/off-heap concurrent ordered map for the JVM. It has great impressive scaling and memory improvements over similar implementations, and it is already being integrated into Druid, which is a system that makes heavy use of this type of data structure.

Hortonworks Data Analytics Studio is now generally available. DAS is a web application that provides things like auto-complete of Hive queries, recommendations to improve query performance, and much more. Tgere are some good screenshots and gifs in the announcement.

BlueData has announced support for Google Cloud Platform and Microsoft Azure in their Big-data-as-a-Service platform.

Version 2.4.1 of Apache Kylin, the OLAP engine for big data systems, was released. It includes 22 bug fixes and improvements.

LinkedIn has open sourced TonY, their TensorFlow on YARN library. It includes fault tolerance via checkpointing to HDFS, leverages Hadoop's GPU scheduling and isolation, and has forthcoming support for TensorBoard, which is a diagnostic tool for TensorFlow.

Facebook has open sourced LogDevice, its distributed log system (with many similarities to Apache Kafka / Pulsar). It's written in C++ and there's a new website with docs on the architecture, configuration, running locally, and more.


Crunch Data Engineering and Analytics Conf (Oct 29-31, Budapest) is offering you a $60 discount with code WeeklyCrunch on Regular and Late Bird tickets. Google, Slack, Apache Arrow… full lineup below.

Mount Sinai School of Medicine is hiring Data Engineers; come work on cool research and important applied problems in NYC's largest healthcare system! to apply



Scott McMahon on Fast and Easy Stream Processing (San Diego) - Tuesday, September 18

Bay Area Apache Spark Meetup @ Adobe (San Jose) - Wednesday, September 19

Beachbody’s Cloud Data Lake Architecture + Harnessing Data Quality... (Santa Monica) - Thursday, September 20

Gwen Shapira: Peeking Behind the Curtains of Serverless Frameworks (San Francisco) - Thursday, September 20

New York

Event-Driven Micro Apps with CQRS & Kafka + Building Microservices (New York) - Monday, September 17


September Presentation Night Hosted by Quantum Black (Boston) - Wednesday, September 19


South West Ruby: Kafka Special (Bristol) - Tuesday, September 18

First Apache Airflow London Meetup (Victoria) - Thursday, September 20


Apache Spark Streaming + Kafka: An Integration Story (Barcelona) - Monday, September 17

Modular Spark: "Spark & AI Summit" + DataEngConf Preview with Albert Franzi (Barcelona) - Thursday, September 20


Big Data Stacks and Kafka Magic (Utrecht) - Tuesday, September 18


Apache Kafka Makes Everything Better? Lecture + Coding Session (Leipzig) - Thursday, September 20


Blending Event Stream Processing with Machine Learning Using the Kafka Ecosystem (Milan) - Wednesday, September 19


Introduction to Azure Databricks (Brisbane) - Thursday, September 20