Data Eng Weekly

Hadoop Weekly Issue #237

22 October 2017

The theme of this week is certainly stream processing—Spotify's Scio, a couple of posts on streaming delivery semantics, KSQL, Kafka Streams, and more. The "Stream All the Things!" presentation nicely summarizes the theory and practice of the (Kafka-based) event sourcing pattern that has been a recurring topic over the past year or so. In releases, John Snow Labs has open-sourced a natural language processing library for Apache Spark.


If you're using PySpark, you've probably wanted to combine it with Pandas or other python libraries. This post describes why this is a bit of a challenge and provides some code to convert data between numpy types and PySpark-compatible types (and vice-versa) for implementing custom user defined functions. It's an in-depth article that also explains some of the internals of PySpark.

Spotify uses a tool they've open-sourced, called Scio, for many of their big data jobs. This post looks at the evolution of their data infrastructure from Python MapReduce jobs with an on-prem Hadoop cluster to usage of Google BigQuery for most ad-hoc queries. The post motivates this move by describing the advantages of the Beam Model that Scio uses, describes why they've written a Scala API atop of Apache Beam, and more.

This post from Streamlio looks at delivery semantics in stream processing. For exactly once (or effectively once) delivery, it describes two strategies—distributed snapshot/state checkpoint and at-least-once with deduplication.

On the topic of exactly once, this post describes how to implement deduplication with Apache Spark's Structured Streaming. In addition to deduplication based on watermark, you can also add custom logic for a stateful aggregation (using mapGroupsWithState).

Building a machine learning model that adapts in real time to new information has long been a end-goal of many ML pipelines. Kafka Streams makes this relatively easy by using the same code for offline and online training. This post walks through building out a real time evaluation and training pipeline with flight arrival data as an example.

The Rittman Mead blog has a good intro to KSQL for Apache Kafka. The post walks through an example of aggregating Twitter data with SQL and performing a join across streams. There are several diagrams to help communicate KSQL semantics.

The Confluent blog has a guest post from Zalando on how they use Apache Kafka to index and rank information from fashion websites. The system uses the Hyperlink Induced Topic Search (HITS) algorithm and is built on Kafka Streams. The post describes both HITS and how they use Kafka Streams on AWS.

This presentation covers a lot of ground on building a streaming-centric application and data platform. First, it motivates this event sourcing architecture by demonstrating the simplicity it brings to connecting services together. Next, it discusses the trade-offs of various streaming/messaging systems (Akka, Spark, Flink, and others) for realtime and analytics use cases. Towards the end, there are two Venn diagrams that nicely sum up the benefits of the new architecture—microservices and fast data have a lot more overlap in architecture and concerns vs. the previous world of services with big data.


Forbes has an interview with Confluent co-founder and CTO Neha Narkhede. In it, she talks about her experiences before and in industry, the growth of Kafka at LinkedIn, the founding of Confluent, and more.

MapR has announced a managed service for the MapR Converged Data Platform.

The team at Spotify has interesting idea for solving data quality and reliability issues. They have gamified test coverage and quality by introducing the Test Certified for Data-Focused teams program.

Slate has an article reflecting on how the term big data is no longer widely used. There are several interesting observations on how business and algorithmic assumptions related to data have changed over the past few years.

Amazon Redshift has announced support for a new family of nodes (dc2) that are the same price but provide up to twice the performance as the previous generation.


Version 1.20.0-incubating of Apache Pulsar has been released. Major new features include end-to-end encryption, support for event time, deduplication of messages, and more.

The John Snow Labs NLP library is a new open source framework for natural language processing on Apache Spark. The post describes it in more detail, including how it compliments Spark's ML libraries and provides better performance than other solutions.


Curated by Datadog ( )



Tensorflow on Apache Hadoop YARN (San Francisco) - Wednesday, October 18

Apache Ignite: Building Consistent, HA Distributed Systems & Memory-Centric SQL (San Ramon) - Thursday, October 19

The Evolving Landscape of Data Engineering & How Systems Fail (San Francisco) - Thursday, October 19


Spark MLlib (Salt Lake City) - Monday, October 16


KSQL and Data Management with Kafka (Minneapolis) - Wednesday, October 18


Leveraging Messaging Platforms Such as Kafka for Real-Time Streaming Transaction (Atlanta) - Thursday, October 19

North Carolina

Real-World Deployments with Apache Kafka (Raleigh) - Tuesday, October 17


Spring for Apache Kafka (Richmond) - Wednesday, October 18

New York

Exactly-Once Semantics in Apache Kafka (New York) - Monday, October 16

IRELAND Let's Talk Kafka Streams (Dublin) - Thursday, October 19


Stream SQL and Real-Time Applications with Apache Flink (London) - Wednesday, October 18


Big Data Analytics for Small- and Medium-Size Enterprises (Oslo) - Tuesday, October 17


Helsinki Apache Kafka Meetup October (Helsinki) - Wednesday, October 18


Stream SQL and Real-Time Applications with Apache Flink (Neuilly-Sur-Seine) - Thursday, October 19


Stream Processing and War Stories (Berlin) - Thursday, October 19


Apache Kafka Streams: Building Distributed, Fault-Tolerant Processing Apps (Tel Aviv-Yafo) - Wednesday, October 18