Data Eng Weekly

Hadoop Weekly Issue #247

14 January 2018

It was a busy week with lots of releases—including new versions of Apache NiFi and Lenses, and two new open-source projects related to Kafka. In technical posts, there's coverage of the performance implications of meltdown patching, Samza, Kafka, Spark with Kubernetes, Apache Flink, and StreamSets.


Three posts on the impact of patching for meltdown. First, Appontics has written about their experiences with the effects of the meltdown patches in AWS. They run Kafka and Cassandra clusters, and they experienced some counterintuitive changes as a result to the changes in CPU efficiency. Second, The Last Pickle has a post on how Meltdown impacts Cassandra latency, and third, Databricks has written about the impact they've seen to Spark on AWS.

This post describes several of the new features of the recently released Samza 0.14.0, including the new SQL support that's built with Apache Calcite.

The MapR blog has a two-part post on using Apache Kafka and Apache Spark (streams and ML apis) to build a real-time flight delay prediction application. The post includes code on github and an Apache Zeppelin notebook.

In a follow-up to their post on running Spark via Kubernetes, this post adds instructions for deploying Apache Zeppelin inside of a k8s cluster. The Banzai team has published an image to Docker Hub and sample configs to github to make the process easy.

The Google Cloud Platform blog argues that you should use a strongly consistent database whenever possible, because it makes application and business logic easier to implement. They give a high-level overview of Google Cloud Spanner, which provides "external consistency" guarantees including a comparison to multi-master replication and a brief intro to Cloud Spanner's TrueTime.

This tutorial describes how to run TiDB, a mysql-compatible and Google F1/Spanner-inspired distributed database, on Kubernetes.

If you've been looking to try out Apache Pulsar (incubating) to see what all the hype is about, there are some terraform and ansible scripts to easily spin up a cluster in AWS. After a few setup steps, the automation will build a six-node, fully-configured cluster.

The dataArtisans blog has a post with tips for sizing an Apache Flink (or really, any distributed computing application) cluster by estimating disk and network throughput. It walks through a practical example and the related formulas to make these estimations for a five-node cluster.

This post describes how to use StreamSets to grab data from the Twitter API and copy it to a local file system for analysis.


Trafodion, which implements transactional SQL on Hadoop/HBase, has graduated from the Apache incubator to be a top-level project.

InfoQ has published a new eMag on Streaming Architectures. Behind a email-wall, the content is from contributors employed by Google, Confluent, AWS, and more. It clocks in at over 30 pages, and there's quite a bit of good content about Beam, Kafka, DynamoDB, Flink, and more.

Hortonworks CEO Rob Bearden has recapped 2017, looking at product releases and major partnerships.

Kafka Summit 2018 takes place in London in April. The schedule has been announced—there are four keynotes (including one by Martin Fowler) and 30 sessions featuring speakers from many different types of companies. Early bird registration is available through January 26th.

DZone has an article, based on a survey of over 20 companies, that provides a good overview of types (and specific examples) of use cases across the big data ecosystem. Everything from analytics to real-time processing to machine learning.


Apache NiFi 1.5.0 was released, with improved support for Apache Kafka (processors for Kafka 1.0), integration with Apache Atlas for lineage, improvements to Kerberos handling, integration with the NiFi Registry to version/manage flow definitions, and more.

There's a new reactive Scala client for Apache Pulsar, scala4s.

Databricks has announced general availability of their Databricks Cache feature. It leverages SSDs and columnar compression to improve performance.

ShiftLeft has open-sourced a fork of Apache TinkerGraph, which improves memory usage and implements strict schema validation. The announcement describes the major memory optimizations—by adding a schema, object definitions can reduce overhead by replacing generic key-value pairs (which use lots of memory for HashMap$Node and other requirements).

Kafka Webview is a web-based consumer for Kafka Clusters. It can do various consumer tasks (such as seeking to particular offsets and using custom deserializers). There's a Docker image on Docker hub, making it really easy to try it out.

Version 1.1 of the Lenses streaming platform for Apache Kafka was released. New features include a new topology visualizer, Kubernetes support (including for scaling up/down of Lenses SQL Processors), a ReduxJS web application library, improvements to Lenses SQL, and improved LDAP integration.

The first release of Strimzi, which is a set of images and configuration templates for deploying Apache Kafka on Kubernetes/OpenShift, was announced.

A trio of security vulnerabilities in Apache Geode were disclosed. If you're running a version prior to 1.3.0, it's time to upgrade.

AWS' ETL-as-a-service, AWS Glue, has announced support for Scala as a scripting language. This post has an example of using it for some non-trivial ETL.

MapR has announced a new data governance tool, that provides lineage tracking (and visualization) across data sets.


Curated by Datadog ( )


Airflow, Streaming, and More (San Francisco) - Wednesday, January 17

Replicating Data from MapR-DB with StreamSets Data Collector (Santa Clara) - Wednesday, January 17


Stream Processing with Flink at Alibaba and OfferUp (Bellevue) - Wednesday, January 17

Seattle Apache Kafka Meetup (Bellevue) - Thursday, January 18


Hadoop 101 (West Des Moines) - Thursday, January 18


Event-Driven Architecture Using Apache Kafka (Chicago) - Wednesday, January 17


Hortonworks Data Flow + A Tidy Text Analysis of the Simpsons in R (Green Bay) - Tuesday, January 16


Real-Time Stream Processing with Apache Storm (Roswell) - Tuesday, January 16


Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas, and Pancakes (Vancouver) - Tuesday, January 16


Streaming Analytics Made Easy: Hortonworks DataFlow and Druid (Stockholm) - Thursday, January 18


First Stream Processing Meetup (Barcelona) - Thursday, January 18


Apache Spark Hands-On Workshop (Nurnberg) - Monday, January 15

First Meeting in Karlsruhe (Karlsruhe) - Monday, January 15


Online Incremental Learning on Streams (Krakow) - Tuesday, January 16

Developing Kafka Streams Applications (Warsaw) - Tuesday, January 16


Spark v2.2 Workshop (Bucharest) - Friday, January 19