Data Eng Weekly

Data Eng Weekly Issue #269

17 June 2018

Companies have shared lots of great posts this week—Pandora's web UI for Kafka, metadata management at Netflix, GraphQL at AirBnB, robust data pipelines at DataXu, and fronting Kafka at GO-JEK. There's also coverage of the new YARN long running application scheduler, a high performance single server stream processing engine, and a recap of the recent Spark + AI summit.


Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at, or visit to learn more.


AirBnB has written about their experiences implementing GraphQL as an API gateway atop of Apache Thrift services. The post has a good mix of technical (their architecture including Thrift/GraphQL translators) and non-technical (about how to frame the conversation and seek compromise) topics.

Originally in Chinese, this post analyzes a recent exploit of unsecured Apache Hadoop YARN clusters that was used for cryptocurrency mining. It also outlines how to secure a cluster with publicly accessible endpoints.

Amazon DynamoDB has change data capture feature called DynamoDB streams. It easily integrates with AWS Lambda for real-time processing. This article explains how to use these features to compute real-time aggregates. There's a good discussion of how to tune the system for correctness, for error handling, and to increase throughput. f93547cfb244

It can be a challenge to share large research and government data sets (think atmospheric or satellite data). To make this type of data accessible, this post proposes that organization "Place your Big Data in cloud object storage in a self-describing, cloud-optimized format." It goes into some more details about the challenges (and some solutions) that are unique to these types of data in adopting that practice.

Dataxu shares their solution to data synchronization—handing off data from one step in the pipeline to the next. Rather than relying on file system paths, they have a centralized "file feed" protocol that provides a number of benefits.

This post compares SABER, a single-server stream processing engine, to Apache Flink and Apache Spark. With modest hardware (20 cores, 32GB RAM), SABER outperforms a 5-node cluster of each. In some ways, this post is reminiscent of the "CLI tools are 235x faster than Hadoop" thread from a few years back.

Qubole has a post about their new query optimizer feature that estimates the total amount of memory needed for a Presto query. There are details on the design and correctness results from the TPC-DS benchmark.

Many organizations design microservices so that they each use their own data store to avoid the drawbacks of a multitenant database system. This post describes how Kafka as an event store is an interesting alternative architecture.

The Morning Paper has coverage of the Medea scheduler, which implements scheduling for long-running applications atop of Apache Hadoop YARN. Medea offers constraints like anti-affinity (to keep HBase region servers on separate nodes), global optimizations, and more. The authors compare it to other schedulers like Hadoop YARN's previous scheduler and a Java version of the Kubernetes scheduling algorithm. Medea is in use at Microsoft and is part of the Apache Hadoop 3.1.0 release (YARN-6592).

The GO-JEK team uses a fronting REST service for ingesting data into Kafka. That service in turn writes data to a fronting Kafka cluster, or it fails over to Redis if Kafka is down. This post explains more about the motivation and architecture.

The Apache Hadoop YARN Service Framework makes it quite easy to deploy a long-lived application to Hadoop via a single Yarnfile definition. The Hortonworks blog has a brief overview of what it takes to migrate Apache Hive LLAP from Apache Slider to use the YARN Service Framework.

This post introduces Metacat, Netflix's tool for data discovery, programatic dataset metadata access, and more. It is a proxy to other backends (such as the Hive metastore), and it provides advanced features via an elasticsearch index. Metacat is open sourced on github.

Data Eng Jobs

Trovit (Barcelona) and Comcast (Philadelphia) are both hiring engineers. Check out their posts and add your own!


This post has a great overview of main themes from the recent Spark + AI summit as well as brief recaps of a few presentations.

This list of distributed systems papers has been updated with some new content from the past 4 years. If you're interested in learning the fundamentals of distributed system theory, it's a great place to start.

Dremio has announced a new initiative to bring LLVM support to Apache Arrow. They are targeting up to 100x speedups.

Dataworks Summit is this week in San Jose. Here's a preview of some of the talks.


Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at, or visit to learn more.


Pandora has open sourced KBrowse, a web ui and search tool for Apache Kafka. This post walks through how they use KBrowse at Pandora to debug issues with new content.

Apache Crail 1.0-incubating was released. Crail is a distributed storage engine that's optimized for high-performance networking and storage with hooks for data processing frameworks.

Apache Phoenix 4.14 was released. It adds support for HBase 1.4 and several CDH versions (in addition to many previous ones), resolves lots of bugs, supports GRANT and REVOKE, and more.

Version 0.3 of the Kafka Security manager is out. It adds a gRPC/REST gateway service, a read-only mode, and support for Confluent 1.1.0.


Curated by Datadog ( )



Apache NiFi @ DataWorks Summit (San Jose) - Monday, June 18

Apache Ambari 2.7 and Beyond Updates (San Jose) - Monday, June 18

Deep Dive into Apache Metron and Big Data Security (San Jose) - Monday, June 18

All Things Spark: Machine Learning, Atlas Integration, ORC & Hive EDW Updates (San Jose) - Monday, June 18

Kafka and Microservices: Insights from Uber and Confluent (Mountain View) - Tuesday, June 19

Birds of a Feather Sessions @ DataWorks Summit (San Jose) - Wednesday, June 20

New York

Stream Processing Double Presentation (New York) - Thursday, June 21


Intro to Spark Training (Framingham) - Saturday, June 23


Software Engineering with Spark (Herzogenaurauch) - Monday, June 18

Let's Talk about Azure Databricks & Apache Spark! (Karlsruhe) - Tuesday, June 19


PyData Cyprus Meetup #4 (Limassol) - Thursday, June 21


Kafka and Stream Processing Meetup at LinkedIn (Bangalore) - Saturday, June 23


Sydney Data Engineering Meetup (Sydney) - Wednesday, June 20