Data Eng Weekly

Hadoop Weekly Issue #231

04 September 2017

Quite a few releases this week, including a few from Kafka Summit that took place this past week in San Francisco. There are also great articles in this issue covering Kafka, Spark, Pulsar, Flink, and more.


This post describes how to visualize Kafka consumer offsets by exporting them over HTTP and using Prometheus to send them to Grafana.

Evaluating a model that was trained offline in a realtime setting can be a tricky venture. One popular solution is PMML, which has some limitations, but generally works well for a certain set of use cases. The Red Ventures Data Science & Engineering team has written about their experience with MLeap, which is a new alternative to PMML with builtin support for Apache Spark.

Confluent has announced a new project, called KSQL, for running SQL queries across a Kafka cluster. KSQL differentiates between Streams (sequences of facts) and Tables (materialized streams) in its query capabilities, and it offers rich support for windowing. KSQL is in preview, and the code is open sourced under the Apache License.

In part two of their series on Apache Pulsar (incubating), the Streamlio blog describes several important components of Pulsar including its I/O isolation, scalability, security model, multi-language (C++, Java, and Python) API, and operational maturity. Since Pulsar was in use at Yahoo, it has several enterprise features.

PySpark is getting better support for MLlib algorithms. To enable this, there is a nwe PySpark implementation of the persistence framework that was previously Scala-only. This post has more details on the solution and how it fits into a machine learning pipeline.

The Databricks blog also has a great overview of Apache Spark 2.2's Cost-Based Optimizer. The post describes the statistics that feed into the optimizer (including how to collect them), examples of types of optimizations it's able to perform, and some experimental results comparing query time with and without the CBO on the TPC-DS benchmark.

This post provides a brief introduction to running an Apache Flink streaming job on Mesos using DC/OS.

The AWS blog has a tutorial for how to use Amazon CodePipline, AWS Service Catalog, and AWS CodeBuild to run a CI/CD setup for an Apache Spark job.

This post shows how to use Amazon IoT with Amazon Kinesis and Apache Spark to build a streaming IoT application. It includes the necessary AWS configs as well as sample Spark code.


The new O'Reilly book on Streaming Systems is in preview. Authored by several Google software engineers, the Google Cloud Platform has a Q&A with one of them, Tyler Akidau. The book is in early release.

Flink Forward is just a week away—it takes place in Berlin September 11-13. Among others, Netflix, Alibaba, and ING are presenting.


Apache Samza, the stream processing system, announced version 0.13.1. The release includes a few enhancements and bug fixes covering 29 JIRA tickets.

Microsoft Azure has announced availability of the Hortonworks Cloudbreak service for provisioning Hortonworks Data Platform clusters. A single Cloudbreak Controller VM can manage multiple clusters and automatically configure both Kerberos and Apache Knox to secure the cluster. Cloudbreak is available via the Azure Marketplace.

Cask has announced version 4.3 of CDAP. There is a lengthy overview of the new features, which include new features for data preparation, ETL, Apache Ranger integration, and Spark Dataframe support.

MapR has announced the new MapR Orbit Cloud Suite which provides cross-cloud functionality (combinations of public and private clouds), object-tiering (which can offload certain data to cloud object storage), and cloud-native management (provisioning of VMs in AWS and Microsoft Azure).

StreamSets has adds new support for Microsoft Azure, in addition to fixes and other improvements.

Kafka Lenses is a new suite of enterprise tools for Kafka. It includes a web inspector for Kafka SQL, Kafka Connectors, and more. It also provides a operational insights into Kafka clusters, such as exploring partition offsets and managing consumers.

Given that Kafka got its start at LinkedIn, it's no surprise that they have a lot of clusters and some great tooling for those clusters. This week, they open sourced Cruise Control, which is a system for monitoring and tuning Kafka clusters. For example, it will replace a cluster when it fails, de-comission brokers, and keep clusters in balance with respect to disk/network/cpu. The introductory blog post describes design goals and future work.

On the heels of their graduation from the Apache Incubator, Apache MADlib has released version 1.12. The new release of the machine learning on SQL library adds a number of graph algorithms, includes improvements to decision tree and random forest implementations and has better support for summary and sketch calculations.

Apache Atlas 0.8.1 was released.


Curated by Datadog ( )



Kafka Steams: Are Your Streams Keeping Up? Monitoring for a Streaming World (San Francisco) - Tuesday, September 5

Bay Area Apache Spark Meetup (Santa Clara) - Thursday, September 7

Apache Spark and Apache Ignite: Where Fast Data Meets the IoT (Santa Clara) - Saturday, September 9


Back to School: Hadoop 101 (Saint Louis) - Wednesday, September 6


Code, Beer, Repeat v3.0: Kafka and Xamarin (Sao Paulo) - Tuesday, September 5


Jay Kreps and Kai Waehner Talk about Kafka (Munich) - Thursday, September 7

Apache Flink Meetup (Berlin) - Thursday, September 7


Hadoop User Group Meetup (Vienna) - Wednesday, September 6


New Generation Integration with NiFi, Kylo + Spark SQL Internals (Krakow) - Wednesday, September 6


Discussion and Open Space for Emerging Big Data and Analytics Technologies (Singapore) - Friday, September 8