Data Eng Weekly

Data Eng Weekly Issue #255

11 March 2018

The Strata Data Conference was this week, and there's coverage of a few enterprise releases made there. It was also a popular week for open source releases, with Apache Kafka, Apache Flink, and several other projects announcing new versions. There are also several tutorials, a fantastic article on the data engineering space, and great technical deep dives on Kafka at Cloudflare, Instagram's RocksDB storage layer for Apache Cassandra, and Jepsen testing for Aerospike.


At Foursquare, we understand where millions of phones go everyday. Our tech and map are changing the landscape of social, travel, mobile. We’re hiring data engineers, platform engineers, tech leads, full stack web engineers, +++. Join us!


Cloudflare has a great post on their work to optimize a Kafka cluster by enabling compression. The post describes the motivation, the work they did to get the golang client to work efficiently, and their experience with various different compression libraries. In the end, they were able to decrease network and storage usage by 4.5x.

This article provides a fantastic survey of the data engineering space—from tools (such as Hadoop and friends) to the roles and responsibilities of a data engineer to online databases and the CAP theorem to predictions for the future. There's also good list of common terminology (such as data mart, data lake, OLAP/OLTP) and definitions.

Starting with the high-level concept of event sourcing, this post goes into a few architectural options and then describes how to implement event sourcing with Kafka Streams. There are some good tips like how to configure standby replicas and implementing various types of delivery guarantees with Kafka.

An overview of getting Apache Spark, and all of its dependencies (e.g. Java 8) installed on a Mac.

This post shows how to package Python code and deploy/run it via a Oozie workflow.

Instagram has been working on implementing a RocksDB storage layer for Apache Cassandra. As compared to the Java implementation, the RocksDB engine reduces the amount of intermediate/garbage data that Cassandra generates. This, in turn, leads to a much better tail latency due to less garbage collecting. The post has some good insight into challenges in the implementation, which has been open sourced.

For simple ETL, real-time aggregation, event routing, and similar use cases, Apache Pulsar is adding Pulsar Functions. Inspired by AWS Lambda and Google Cloud Functions, Pulsar Functions use a simple API and the Pulsar cluster for deployment. The post covers the design goals, deployment mechanism, runtime guarantees, and more.

The Kubernetes blog has a look at Apache Spark 2.3's Kubernetes integration. It has some details on the implementation, how to get started, and what some of the plans are for the future of the integration.

This post has some good tips for working with EMR and Spark, like how to and why to use the EMRFS and advice for sizing a cluster.

A short walkthrough of taking Apache Spark 2.3's new Kubernetes support for a spin. This post has a few additional details, like how to build a Docker image with your custom code, that aren't found in the Kubernetes post above.

There's a new Jepsen post out on Aerospike. For those who aren't familiar, Jepsen is a framework for verifying correctness of distributed systems in the face of failure. The post goes into the details of the Aerospike architecture (such as its gossip and replication system), performance in the face of network partitions, node failure, and clock skew, and makes some recommendations about how to best configure Aerospike.

BoulderDB is a custom distributed database built with RocksDB. MakeMyTrip uses it with Spark streaming (for data ingestion) and Akka (for serving data out). The post goes into the details of how they've implemented scaling and clustering, how they make use of the lambda architecture, and the scalability of the system.


Big Data Day LA 2018 has opened up the call for speakers. The event takes place in August, and the speaker submissions are open through June 15th.

The Technologist’s Hippocratic Oath - is "an optional oath for building ethically considered experiences." If you want to avoid ethically murky areas, the oath is full of good lines.


This week, Cloudera released a new version of their cloud service Altus, MapR has announced new Kubernetes support via the Kubernetes Volume Driver, and AtScale has announced a new version of BI Platform. ZDNet has more coverage of these announcements.

Apache Kylin, the OLAP system for big data, has released version 2.3.0. It includes over 260 resolved issues. New features include support for Redshift & SQL Server, and a new metric framework.

Version 0.5.0 of the Apache Hivemall (incubating) project has been released. Hivemall provides UDFs for machine learning on Hive/Spark/Pig.

Kafka Security Manager is a new project for managing Kafka ACLs via an external source of truth, like a configuration file. It also provides notifications for integration with tools like Slack.

Apache Kafka 1.0.1 is out with 49 fixed issues since the 1.0.0 release.

StreamSets Data Protector is a new tool that can be used to obfuscate or remove sensitive (e.g. PII) data before ingestion.

Hortonworks announced the release of Cloudbreak 2.4, their system for running HDP in the cloud. New features include a new CLI tool, support for configuring Kerberos, and support for custom images.

Databricks has added an exciting new feature to make it easier to deploy machine learning models from Apache Spark—the ability to export models for scoring and predictions in non-Spark systems.

Confluent Platform 4.1 is out. The major new feature is the general availability of KSQL, which also had a 0.5 release. Since 0.4, the KSQL team has been focussed on quality and stability improvements.

Apache Flink 1.4.2 is out with a bunch of bug fixes and improvements.

BABAR is a new tool from Criteo for profiling YARN applications. Using an agent, it collects system and JVM-level metrics. There is a processor to output a number of different graphs, including a flame graph of JVM-level function execution.


At Foursquare, we understand where millions of phones go everyday. Our tech and map are changing the landscape of social, travel, mobile. We’re hiring data engineers, platform engineers, tech leads, full stack web engineers, +++. Join us!


Curated by Datadog ( )



Introduction to Spark (Denver) - Wednesday, March 14


Building a Streaming Data Platform at HomeAway (Austin) - Tuesday, March 13

New Jersey

Spark Structured Streaming: Hands-On Session, Part 1 (Hamilton) - Thursday, March 15

New York

DAG and the Third Generation of Big Data Stream Processing (New York) - Tuesday, March 13


Big Data and Machine Learning (London) - Tuesday, March 13

Join Us for Our First Kafka Meetup in Leeds (Leeds) - Wednesday, March 14

Building a Real-Time Complex Event Processing Platform with Apache Flink (Manchester) - Wednesday, March 14

Streamy Wednesday: Eventsourcing from Back to Front(end) (London) - Wednesday, March 14


Presto: SQL-on-Anything (Stockholm) - Tuesday, March 13

Experience at Ooyala and Klarna: Apache Kafka (Stockholm) - Thursday, March 15


Pipeline ETL + Kubernetes with Google Cloud (Madrid) - Wednesday, March 14

Integrating Apache Flink with Real-Time NoSQL (Madrid) - Thursday, March 15


Stream Processing: Apache Flink, Kafka Streams (Toulouse) - Friday, March 16


When Not to Use Apache Spark? Data Pipelines with vert.x + RxJava (Berlin) - Tuesday, March 13

4 Talks about Distributed Databases (Munich) - Thursday, March 15


GDPR and Big Data + Spark on Azure Demo (Bucharest) - Tuesday, March 13


Real-Time Sentiment Analysis with NiFi and Zeppelin (Sydney) - Tuesday, March 13