Data Eng Weekly

Hadoop Weekly Issue #228

13 August 2017

Great articles this week on Apache Metron, Apache Airflow, and building an Apache Kafka connector. There are also some good papers from VLDB on Kafka and Apache Flink, and a new releases of Apache Hadoop and Apache Drill.


Stellar is a scripting language built into the Apache Metron infosec analytics project. It has a number of functions for common use cases, supports basic data structures, and provides a REPL. The post provides a brief intro to Metron, motivates the need for Stellar, has a brief introduction (with examples) to the core features, and describes how to build custom functions.

Qubole has a post that summarizes a talk from the Data Platforms conference on migrating from a physical data center to the cloud. Lessons learned include the importance of embracing technology and finding the right tool for the right job (especially in the cloud world where vendors offer a number of tools with different strengths). The post also includes a video of the presentation.

Ambari 2.5 has a bunch of new features, including improved log management/search, improved config management, and new monitoring features (including high availability of the metrics collector). These and other new features are detailed on the Hortonworks blog.

While Apache Kafka is typically used as the message bus for Spark streaming, Spark also supports reading data from Amazon Kineses. This post looks at how to get started with Kinesis—defining the schema, connecting to a Kinesis stream, and doing some basic processing.

Qubole has a post on configuring a Jupyter notebook to talk to a remote Spark cluster using the Livy Job Server (server side) and the sparkmagic package (installed locally). A few parts of the post are Qubole-specific, but it's generally a good tutorial for doing this type of setup.

This is a good tutorial for getting started with Apache Airflow. It describes how to install/get setup (including a few gotchas) and walks through defining a DAG and a few tasks.

The Proceedings of the Very Large DataBases has several great papers from the past year. They just released the August papers, and of particular interest to this group may be the one on a modified "Queryable Kafka" and another on how Apache Flink does state management.

Hadoop and several other tools are inspired by similar systems from Google. Given that papers are published several years into development, Google tends to be far ahead of open-source alternatives. While I don't necessarily agree with the premise of the question in this article, it does have some good points. In particular, we're seeing Google (and other tools) come full circle on certain design trade-offs (such as removing SQL then bringing it back).

Great post on the considerations of building a Kafka Connect Source connector. It looks at several of the trade-offs and design challenges in implementing a Kafka connector including poling vs. change data capture, initial data bootstrapping, extracting keys for topics, and schema evolution.


A couple of weeks back, Apache Drill 1.11.0 was released. Major new features include crypto functions, support for PCAP files, and support for network encryption. The release also includes a number of bug fixes and improvements.

Apache Hadoop 2.7.4 was released. It includes over 250 bug fixes and improvements to HDFS, YARN, and MapReduce.


Curated by Datadog ( )



#SD BigData Meetup #22 (San Diego) - Tuesday, August 15

Kafka Connect Implementation at GumGum and More (Santa Monica) - Tuesday, August 15

Ask the Expert: Data Science Workbench (Irvine) - Wednesday, August 16


eBay’s Billion-Item Inventory in the Public Cloud (Bellevue) - Wednesday, August 16

Spark ML at Scale + Building Continuous Applications with SnappyData (Bellevue) - Thursday, August 17


Big Data Utah Meeting @ Overstock (Midvale) - Wednesday, August 16


Apache Nifi Deep Dive (Columbia) - Wednesday, August 16


Sensors, Spark, and Kafka: Applied Machine Learning (West Allis) - Tuesday, August 15

New York

How to Process 50 Billion Monthly Messages with Full Availability & Performance (New York) - Tuesday, August 15


Log Ingestion & Analysis w/ Kafka, Apache NiFi, and Hadoop (Vaughan, ON) - Saturday, August 19


Shanghai Apache Spark (Shanghai) - Saturday, August 19

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit