Data Eng Weekly

Data Eng Weekly Issue #250

04 February 2018

One last quick reminder that Hadoop Weekly is now Data Eng Weekly—new name but the same great content.

In this week's issue, there are some great technical posts, including ones on change data capture at MyHeritage, support for long running services in YARN, Salesforce's Platform Events, and the Kubernetes scheduler for Spark. There are also several new releases, including a new version of Hortonworks DataFlow and Confluent's KSQL.


Hello Fresh: Change the way people eat forever. Work with our data technology to deliver healthy meals to millions of customers, with a cutting-edge tech stack (Hadoop, Kafka, Impala, pyspark, AWS, Airflow) and time for personal and engineering development. Click the link for more info on becoming a Data Engineer at Hello Fresh.


MyHeritage has written about how they enrich database change events captured using the Maxwell Change Data Capture tool with request details. Making use of a blackhole MySQL table, they log request ids alongside query thread ids, which can later be used to link the request and CDC data together in a streaming application.

This post provides practical tips for writing testable, configurable, and deployable Spark applications. For example, by separating IO from main logic, you can easily write unit tests and setup a CI build.

YARN is adding support for long-lived services in an upcoming release. While Hadoop is well behind Kubernetes when it comes to container deployment, YARN does have some compelling use cases such as Hive and its LLAP. There are a lot more details in the post, but—in short—YARN's new service frameworks supports docker and deploying apps via a RESTful JSON API.

Qubole has built a tool called git-patch, for tracking and incorporating changes from an upstream open source project. Teams (like one's I've been on before) that end up forking open-source streaming/batch processing projects (e.g. to backport a patch or otherwise customize) may find it useful.

The folks at Wallaroo have written about their new Kafka client written in the Pony language. They describe how they came to the decision of writing their own client, how long it's taken to get a basic client going, performance characteristics of the new client, and the upcoming roadmap.

This post describes how to use the Apache Kafka-based Landoop stack to analyze message data over MQTT using Kafka streams and Landoop Lenses SQL and ultimate to land data into influxDB.

Apache Spot (incubating) is used to analyze network data to detect infosec threats. This post provides a good overview of the architecture, which is built on Apache Kafka (for ingestion), Apache Spark (for ingestion and ML analysis), Apache Hadoop (for ingestion and storage), and more. Spot also uses the Python Watchdog library, which detects changes on the file system and fires off events.

Apache Spark has a Kubernetes scheduler for deploying Spark executors. This post describes both the fixed-number of executors scheduler as well as the dynamic scheduler, which can shift the number of executors based on changes in needs as calculated by an ExecutorAllocationManager.

Netflix develops a lot of features based on (e.g. A/B testing) experiments and ML pipelines. To support this, they need reliable workflows for training and deploying models. This presentation covers their workflow system, Meson, for automating these pipelines. Meson has a Scala DSL, a Python DSL, and a REST API (which also powers a web UI) for defining and executing DAG workflows. Meson has some nice features, like immutable versioning and canary workflows. If you don't have time for the whole talk, there are some slides in PDF, too.

In another post from Wallaroo, they describe the improvements to their Python API for stream processing that make it more idiomatic. The post details the basics of application setup and both stateful and stateless streaming applications.

Salesforce's Platform Events are built-on Apache Kafka, Apache Avro, and a custom metadata engine. With it, customers can subscribe to events in a mechanism inspired by Kafka. This post looks a bit under the hood, including how Kafka helps with multitenancy.

AWS has a post on how to hook up a few of their services to give your EMR clusters a friendly DNS name (using cloudwatch + lambda + route53).


Datawarehousing in the cloud company, Snowflake Computing, has raised a $263 million series E.

The conference agenda (including talks and speakers) for Flink Forward, which takes place in April 9-10 in San Francisco, has been released.

Although they've been around a while, these "Principles of Chaos" came across my radar this week. There are several good tips for how to run and operate a distributed system and a community focussed on this topic.

The Data Engineering Podcast has been covering a lot of great topics recently. This week, they discuss Apache Pulsar.


Version 3.1 of Hortonworks DataFlow (HDF) was released. It's based on Apache NiFi 1.5, so it brings new features like the NiFi Registry, the C++ MiNiFi agent, and lots of new integrations. For stream processing, it adds support for Apache Kafka 1.0 along with integrating Kafka with Ambari and Ranger.

Confluent announced the January release of Confluent KSQL. It includes KSQL shell enhancements and new aggregate functions (topK and topkdistinct).

Version 0.1.7 of Jepsen, which is a tool for testing correctness of distributed data systems in the face of failure, has been been released. It includes some bug fixes, changes, and new features.

Banzai has announced the 0.2.0 release of their Pipeline software for deploying Kafka, Spark, Zeppelin, and microservices on cloud providers. The new release adds support for Microsoft Azure AKS and several other new features.

Spring Cloud Data Flow 1.3 was released. Changes include a new top-level project for Kafka Streams integration, updates to the Dashboard UI, a new Java DSL for defining streams, Kubernetes integration, and more.

Apache Scoop 1.4.7 is out with a number of bug fixes, improvements, and new features from over 100 JIRA tickets.

Lighthouse is a new project aimed at building a data lake with Apache Spark. It's in early days, without too much documentation, but there's a good high-level overview in the slides below.


Curated by Datadog ( )



Bay Area Apache Spark Meetup @ Workday (San Mateo) - Thursday, February 8

Data @ Uber (Santa Clara) - Saturday, February 10


Event Sourcing from Scratch with Apache Kafka and Spring (Olathe) - Wednesday, February 7


The Spark for Real-Time Decision Making (Saint Louis) - Wednesday, February 7


Introduction to Apache Kafka (Brentwood) - Tuesday, February 6


Real-Time Event Processing with Apache Flink (Nieuwegein) - Wednesday, February 7


Streaming Database Changes with Debezium (Hamburg) - Thursday, February 8


Monitor Page Splits in SQL Server + Learn about Apache Kafka (Ra'anana) - Monday, February 5

Data Integration & Processing Solutions on Google Cloud (Tel Aviv-Yafo) - Tuesday, February 6

Fiverr: Lessons Learned, Design and Architecture Advice (Tel Aviv-Yafo) - Wednesday, February 7


Data Engineering Meetup (Sydney) - Thursday, February 8