Data Eng Weekly

Hadoop Weekly Issue #243

10 December 2017

Lots of great technical content in this weeks issue—three posts on data systems and Kubernetes, a good overview of Apache Hadoop & Kerberos, an overview of the architecture of Apache Pulsar (incubating), a look at the new Wallaroo stream processing framework, and more.


For security, Hadoop uses Kerberos' Delegation Tokens to allow distributed programs to execute on behalf of the user (and thus access data that the user has permissions for). This post looks at how this is implemented for YARN and Map/Reduce tasks, some common errors, the delegation token lifecycle, and more.

Hue 4.2 added improved support for Impala. The SQL editor now supports computing table stats, looking at the profile, and visualizing the output of the query planner.

As Kubernetes seems to be gaining lots of traction for container orchestration, it's pretty natural to try to run Spark jobs with it. This first post describes how to do so, and it describes some of the current shortcomings in the current implementation. The second looks at how to then integrate with Apache Zeppelin, which has a few gotchas.

This deck briefly provides an overview of Kafka Streams, shows how it integrates with Spring Cloud Streams, and describes how to use Kubernetes StatefulSets to do stateful stream processing. More details about running Kafka via Kubernetes/OpenShift are provided at the readme of the barnabas github repo.

This post describes how to use Cloud Dataflow Pipeline templates to periodically ingest data from a tweet stream. The job is triggered via App Engine cron, and the data is available for query in BigQuery once it's processed.

While Apache Pulsar (incubating) shares some similarities with Apache Kafka, it has a different architecture. Namely, it has stateless brokers and separate storage bookies (services implemented via Apache BookKeeper). Data is stored as segments, which allows scale up without rebalancing. The post describes this architecture and compares with Kafka. Also of note: Pulsar provides a Kafka-compatible API, which aims to provide drop-in compatibility.

On the topic of Kubernetes (k8s), this post describes some of the challenges for adopting k8s for the Hadoop stack. Fundamentally, k8s is targeting stateless apps (although there is some new support for volumes and stateful sets) which makes it a challenging fit for HDFS and other apps storing data. With that said, the BlueData team has built a prototype for indirectly deploying Hadoop with k8s via their EPIC software.

This post demonstrates how to use StreamSets to infer a schema for a CSV file, to convert certain field types to non-string values, and to extract the schema & data from the resulting avro data file.

Wallaroo is an up and coming stream processing framework with first-class support for non-JVM languages. This post looks at the python Pipeline API, describes how Wallaroo partitions data, and describes its mechanism for stateful processing. The post uses a stock market (pricing and order) example to illustrate these pieces.

The Slack engineering blog tells the story of scaling their job queuing system. Originally built on redis task queues, there were some architectural drawbacks (particularly around memory). Rather than a big-bang switch over, they put Kafka in front of Redis to absorb some of the write throughput. In the process they implemented two new services in golang (to front-Kafka and to relay to redis), which are described in detail in the post.

This post uses the Hail framework for analyzing genomics data. The output of that analysis is stored in S3, and with some help from AWS Glue and is then available for querying via Amazon Athena.

This post describes how to ensure that data is deleted from Kafka, by null'ing out the value of a record and ensuring that compaction (which does the cleanup behind the scenes) runs periodically.


Databricks is renaming Spark Summit to be Spark + AI Summit in 2018.

Two podcasts of note this week: Roaring Elephant features folks from Streamlio on Apache Pulsar, and Software Engineering Daily has Martin Kleppman on conflict-free replicated data types.


Version 0.3.0 of Apache MiNiFi C++ was released. It's still not considered ready for production, but the new version has new features including support to write directly to Kafka.

Druid 0.11.0 was released. Major new features include TLS support, a redis cache extension, various improvements to Druid SQL, and GroupBY performance improvements.


Curated by Datadog ( )



Using Open Source Data Platforms to Deliver New Insights (Denver) - Wednesday, December 13


Reactive Applications with the SMACK Stack (Kanata) - Thursday, December 14

IRELAND Airflow, Big Data, and Data Science (Dublin) - Thursday, December 14


Processing Streaming Data at a Large Scale with Kafka (Lisbon) - Monday, December 11


Hadoop Live (Madrid) - Tuesday, December 12


Streaming Architectures (Paris) - Tuesday, December 12

Paris Fast Data (Paris) - Tuesday, December 12


Aggregating Online Experiments Data & Comparing Hadoop Cloud Offerings (Amsterdam) - Tuesday, December 12

CodeBreakfast: Apache Airflow Edition (Amsterdam) - Friday, December 15


2nd Sydney Data Engineering Meetup (Sydney) - Thursday, December 14


Azure Stream Analytics (Wellington) - Thursday, December 14


Scala Taiwan #21: Introduction to Apache Kafka (Taipei) - Wednesday, December 13