Data Eng Weekly

Data Eng Weekly Issue #272

08 July 2018

Lots of interesting topics this week, including Flink at Netflix, Facebook's embedded machine learning library, a look inside the core Spark codebase, and how Goodreads mirrors data from DynamoDB for analysis by Amazon Athena.


Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at, or visit to learn more.


Spiral is Facebook's embedded real-time machine learning library. It can be run in an embedded mode, which enables a few interesting use cases, like building a model to determine which entries to invalidate in a cache based on the parameters of an update.

Have you been bitten by running a job (or an entire workflow) with bad input? This post outlines a reusable, Airflow operator that executes some SQL queries to sanity check data before running the next job.

Consistency in distributed systems can be really hard to understand. This map provides a good overview of the various models. Each model links to a page where you can read a whole lot more about their semantics.

Databases 101 aims to provide a good overview of databases, and to some extent, distributed systems. It covers a lot of ground, like types of databases, consistency models, and ETL. While it's aimed at the non-technical audience, it's a great reference even if you know this stuff pretty well.

A common data architecture problem, especially when integrating with legacy systems, is how to handle out of order events in Kafka. This post describes a solution in which events are cached in Redis.

The Databricks blog has a post describing a streaming application built with AWS Lambda, Amazon Kinesis, and the Databricks platform (including Databricks delta).

The Morning Paper covered two data systems that aim to provide strong security and privacy guarantees. The first, EnclaveDB implements a secure database that can be run in an untrusted environment (there are some constraints, like all queries must be determined ahead of time). The second, Oblix, implements a search index over encrypted data. Lots to each of these, and they're probably a good peak at some techniques that might soon get commercial adoption.

Interesting use case of converting XML to RDF at relatively large scale (six billion RDF tuples) using Apache Beam (running on Google Cloud DataFlow) and coordinated by Apache NiFi.

A great deep dive into the Spark codebase (including testing and build tools) by way of adding a new UDF to Spark SQL.

Netflix's stream processing system processes over 4 trillion events per day (36GB/sec) using thousands of servers. To enable teams within the organization, they've built a self-service infrastructure that is built on Apache Flink and Apache Kafka. Much of this presentation centralizes around why and how they use Flink, including what scalability challenges they've faced (e.g. in checkpointing), fine grained recovery, and how they implement reprocessing from scratch.

A look at how Goodreads transfers data from DynamoDB to S3 (in Parquet Format) for query by Amazon Athena. Since Dynamo is optimized for transactional queries, this enables some analytics use cases that are awkward directly against Athena.

Pandas on Ray is a project to implement a distributed processing backend to Pandas, the popular Python data analysis framework. This post describes the state of the implementation and provides some benchmarks with performance characteristics when run on a multi-core machine.


Etsy is hiring Senior Data Engineers in New York.

Netflix is hiring a Senior Data Engineer, Playback in Los Gatos.

Job postings on the Data Eng Weekly job board are now just $99. Submit a job to reach your peers looking for something new!


Presto Summit is coming up in just over a week, on July 16th. It's free and takes place at Facebook.

Apache Cassandra is 10 years old, and Datastax has an infographic that highlights some key milestones in the project's history.

California has enacted a new law, the California Consumer Privacy Act of 2018, that goes into effect in 2020. It has a lot of similarities to GDPR—including rights to know and delete your data.


Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at, or visit to learn more.


Azure HDInsight has added updates to Apache Spark, Apache Kafka, ML Services, and more.

MapR 6.1 was recently released. This post describes some of the main features, including changes to MapR-DB (their database) and MapR-ES (their Kafka-compatible pub/sub system).

The Apache Kafka team announced maintenance releases of 0.10 and 0.11. Both releases include a number of bug fixes and improvements.


Curated by Datadog ( )



Fast and Easy Stream Processing (Irvine) - Wednesday, July 11


Seattle Apache Kafka Meetup (Bellevue) - Thursday, July 12


Big Data at the Speed of Business with the Elastic Stack (Denver) - Wednesday, July 11


Optimizing and Indexing Streams with Kafka for Low-Latency Queries (Austin) - Tuesday, July 10


Event-Driven Architectures in Node.js Using Kafka (Detroit) - Wednesday, July 11


Free Big Data Open Class: Getting Started with Apache Spark on AWS (Toronto) - Saturday, July 14


Intro to Big Data: Apache Kafka (Guatemala) - Tuesday, July 10


Big Data Bristol Evening Session (Bristol) - Wednesday, July 11


JavaScript Everywhere + Kafka: Bringing Unicorn Technology to the Enterprise (Bucharest) - Wednesday, July 11

RUSSIA Moscow Apache Ignite Meetup #3 (Moscow) - Tuesday, July 10


ML & Analytics Optimized by Cloudera Altus (Melbourne) - Wednesday, July 11