Data Eng Weekly

Data Eng Weekly Issue #256

18 March 2018

Lots of technical posts on Apache Kafka and stream processing this week. There are also great articles on Apache Airflow at Pandora, Apache Spark, Presto, and Apache Hive for data warehousing. Apache Flink had a release, and the call for proposal for Strata NY ends in a couple of days.


Cut HBase storage needs by 50%; elevate job throughput by 260%. ScaleFlux Computational Storage cohesively optimizes your Hadoop HBase infrastructure. Significant performance leaps on Spark, RocksDB, and Aerospike as well. Now shipping - see more:


Configuring nodes in a Hadoop/Spark cluster with non-JVM dependencies has always required jumping through some hoops. If you're trying to figure out the right hoops to jump through for running PySpark with pandas and other python libraries, this post outlines a solution and provides some sample code.

Streaming ETL provides a number of advantages over batch—and the tools are getting better and better every month. This post describes many of the advantages, including improved latency and the ability to more easily send data to multiple downstream systems.

This article summarizes a number of design decisions that Netflix has made for their Keystone stream processing system. It's full of tips for operating Kafka at scale, including the practice of having a separate "fronting" Kafka cluster to scale producers and consumers separately, how Netflix sizes clusters and topics, and how they handle cluster upgrades.

GO-JEK, an Indonesian tech startup, has written about the tech stack they use to process 6 billion events/day. Built on Apache Kafka, Apache Flink, Kubernetes, Chef, and InfluxDB (to name a few), the infrastructure supports 18 product lines. The post also talks about how they've implemented monitoring and resiliency tests.

An initial look at migrating from Apache Impala to Presto, this post describes some of the architectural differences (e.g. metadata caching, statistics) and Presto's support for Apache Parquet files.

The monitoring system Prometheus has a Java Agent to expose JMX metrics over HTTP. Banzai has a post that details how to configure it with Apache Kafka. They've also updated their post on integrating Spark and Prometheus—there's now a Spark package to pull in the metrics Prometheus metrics sink.

A good overview of the key components of Apache Kafka security—TLS, certificates, SALS, and ACLs. There are also a number of links to dig in deeper to each of the topics.

The Databricks blog has an overview of stream-stream joins, a new feature of Apache Spark 2.3. They demonstrate the various concepts using the example of joining streams of ad impressions and clicks.

NewRelic has posted the first two parts in their series on Apache Kafka. Part one covers how they use Kafka for streaming applications, to snapshot state of microservices, and for distributing work using the disrupter pattern. Part two has a good overview of Kafka partitioning and describes how New Relic orchestrates Kafka with Marathon.

Pandora has written about their migration to Apache Airflow for workflow scheduling and how they operate it within the company. They provision Airflow instances on demand using Ansible, have integrated with supervisor, consul, Prometheus, and Grafana for monitoring, and make use of Airflows LDAP authentication. The post also describes some of the issues that they have run into with it, some of which lead to them maintaining their own fork.

Apache Flink made a lot of progress towards their next release in February, including on changes to the deployment model (which will improve deployments on Docker/Kubernetes), a new SQL CLI, local state task recovery to speed up, and rescaling jobs.

A good intro to Spark checkpointing, how it compares to caching, the trade-offs of enabling checkpointing, and when you may want to use.

This post provides an overview of the optimizations that Apache Kafka implements to achieve high throughput.

A thorough overview of the options (and their tradeoffs) for streaming data from an RDBMS into Apache Kafka.

DataXu has the latest in a series on lessons learned in data warehousing. In this post, they talk about how to enable various ACID semantics through deliberate usage of metadata stored in the Hive Metastore. For example, you can end a long, multi-table transaction by repointing the tables to new datasets with Hive's alter table ... set location command. The post also talks about best practices with Spark SQL and for schema evolution.

The Landoop blog has a couple of guest posts from WalmartLabs on integrating Apache Kafka with Apache Cassandra. It focuses on sending data from Cassandra to Kafka using Kafka Connect. The first part describes the basic of the connector and the second part describes optimizations.


There's a week left to apply to the Insight Data Engineering Fellows program. It's a 7-week training fellowship in Boston, New York, or Silicon Valley.

Strata goes to NYC in September, and the Call for Proposals is open until end of day Tuesday.


Apache Flink 1.3.3 is out. It includes a few bug fixes (including one for checkpoints) and an improvement.

Airflow Plugins is maintaining a list of plugins related to Apache Airflow.


Cut HBase storage needs by 50%; elevate job throughput by 260%. ScaleFlux Computational Storage cohesively optimizes your Hadoop HBase infrastructure. Significant performance leaps on Spark, RocksDB, and Aerospike as well. Now shipping - see more:


Curated by Datadog ( )



Big Data, San Francisco v 4.0 (San Francisco) - Tuesday, March 20

Stream Processing with Apache Kafka & Apache Samza (Sunnyvale) - Wednesday, March 21

Data Operations: A New Discipline for Data in Motion (Sunnyvale) - Wednesday, March 21

How to Use Java 8 Streams to Access Existing Data with Ultra-Low Latency (Santa Clara) - Wednesday, March 21


Cleveland Big Data Meetup (Mayfield Village) - Monday, March 19


How to Share State across Multiple Spark Jobs Using Apache Ignite (Roswell) - Tuesday, March 20


High-Speed Data Visualization: Kafka Meets Elasticsearch (Ashburn) - Wednesday, March 21

Kafka: Streaming Data Query Using KSQL (Richmond) - Thursday, March 22


Microservices and Kafka with Node.js (Medellin) - Thursday, March 22


Node.js with Kafka and GraphQL (London) - Tuesday, March 20


Schibsted Data Journey (Barcelona) - Thursday, March 22


Building Event-Driven Microservices with Apache Kafka and Kafka Streams (Paris) - Wednesday, March 21


Apache Kafka by Alex Piermatteo & KSQL by Kai Waehner (Munich) - Wednesday, March 21


"Classic" ETL Approach in Hadoop Ecosystem (Prague) - Thursday, March 22


Elasticsearch Best Practises for Performance and Scale (Oslo) - Thursday, March 22


The Move from Micro-Service to Event-Sourcing Architecture (Ramat Gan) - Wednesday, March 21