Data Eng Weekly

Hadoop Weekly Issue #239

05 November 2017

Apache Kafka hit the 1.0.0 milestone this week—congrats to all those involved! Whether in celebration or by consequence, there are a bunch of Kafka-related posts this week covering applications built with Kafka streams, operational considerations for Kafka, the differences between the Kafka Streams DSL and Processoro API, and debugging a production Kafka issue. There are also great posts on structured logging, Scio, and Python UDFs for Spark.


When it comes time to analyze data, it's important to have well-structured and enriched source data. Structured logging, in which metadata and a schema are applied to events, is a good way to ensure high-quality. This post looks at how to build an API for structured logging in just a few lines of code.

Part two of the blog post series on Scio covers the core features of the API, additional syntactic sugar (such as support for side-input joins), and support for type-safe BigQuery (via a macro that performs a dry-run query to generate a case class for output rows). The post also describes some of the ways it is used at Spotify—ad hoc queries (by over 200 people), a tool called BigDiffy (for pairwise comparisons), and another tool, Featran, for feature transformations.

This post demonstrates how to use Kafka for a non-trivial stream processing application. Specifically, data is pulled from an HTTP endpoint and inserted into Kafka using the Producer API. From there, a Kafka Streams application performs fraud detection (with a stubbed out method in this example) and computes long-term and 90-day aggregated statistics. Finally, Kafka Connect writes data to a PostgreSQL database for serving up via REST API.

The MSDN blog has a four-part blog series focused on learning to use Azure for folks who are already familiar with Hadoop. It teaches the basics like access management, networking, storage, compute, and the Azure Data Services (including HDFS compatible PaaS offerings).

Zolando has a guest post on the Confluent blog about several operational / SRE considerations for Kafka. These include the type of EBS volumes to use on AWS (specifically, EBS volume type is important since RocksDB makes lots of writes, and for some volume types there is a burst capacity budget that is important to monitor), tuning some defaults, monitoring of consumer lag, monitoring memory usage, and optimizing instance memory (with the counter-intuitive realization that a smaller JVM heap is better given that Kafka uses lots of off-heap memory).

Via an integration using Apache Arrow, Spark 2.3 will include support for so-called vectorized user defined functions (UDFs). Performance is greatly improved when a UDF is annotated with @pandas_udf and consumes/returns a pandas.Series. There are several example UDFs, including simple addition, cumulative probability, & OLS, and the post has a microbenchmark that shows 3x-100x speedups.

The team at Honeycomb has written about a recent outage they had with their Kafka cluster. It's interesting to see all the metrics and graphs that they have to debug and to understand some of the new instrumentation that they plan to add.

With an example of stream processing of page view and event data, this post describes how to implement a non-trivial application using both the Kafka Streams DSL and the underlying Processor API. In this example (based on a real-world use case), the Processor-version is significantly more efficient for only a bit more code.


The Magpie talk show podcast has an interview with Confluent CEO Jay Kreps.

The Call for Papers for Kafka Summit London, which takes place April 23-24, has been announced. Submissions are due December 1st.

A CVE for Apache Hive has been announced. Users are encouraged to upgrade to version 2.3.1, but the announcement also describes mitigations for lower versions (2.1.0 through 2.3.0 are effected).

The Insight blog has a good post on some of the data related talks, advice, and themes from this year's Grace Hopper.


This week, Apache Kafka hit version 1.0.0. The new release includes performance improvements, faster TLS, Java 9 support, and more. The Apache Software Foundation has coverage of the milestone, and the Confluent blog describes several of the improvements in recent versions leading up to version 1.0.0.

Version 2.6.3 of HDP has been released. It's the first release to support the Hortonworks DataPlane Service, and there are several new versions of packages (Spark, Zeppelin, Livy, Druid, Atlas, Knox, Ambari, SmartSense, and Ranger).

Qubole has released a Qubole Jump Start to run the Qubole Data Service on the Oracle Cloud Infrastructure.

A new project providing JDBC support for Confluent KSQL has bene announced. The initial 0.1.0 release was made earlier this week.


Curated by Datadog ( )



Big Data Security, Apache Pulsar, and More! (Palo Alto) - Wednesday, November 8

Women in Big Data Apache Spark Meetup (San Francisco) - Wednesday, November 8

Hands-On Workshop: Apache Pulsar (Santa Clara) - Saturday, November 11


Big Data Solutions in Azure (Chicago) - Monday, November 6

Cliff Gilmore on KSQL (Chicago) - Wednesday, November 8


Apache Spark Saturday Hands-On Workshop and Lecture (Madison) - Saturday, November 11


Spark with Tim Hunter (Villeneuve d’Ascq) - Thursday, November 9


Kafka Meetup (Berlin) - Tuesday, November 7

RUSSIA Kafka Use Cases in Sberbank & Kafka Multi–Data Center Cluster Topology (Moscow) - Thursday, November 9


Scale Up! Meetup (Bangalore) - Monday, November 6

Productionalizing Spark ML Pipelines (Bangalore) - Saturday, November 11


Stateful Stream Processing Explained with Apache Flink (Taipei) - Monday, November 6