Data Eng Weekly

Hadoop Weekly Issue #206

26 February 2017

Lots of releases this week, including the Apache Kafka release and maintenance releases of Apache NiFi and Apache Cassandra. Kafka is also the subject of two technical posts—one on Kafka Connect streaming of data in FTP and the other on change data capture.


The Landoop blog has a post and sample code that shows how to load data from files on an FTP server to Kafka using Kafka Connect. They have example implementations for XML (using the Iradiance Solar Data set), CSV files, and binary compressed files. The demo also showcases open-source Kafka tools built by Landoop, including web uis for the Confluent schema registry and Kafka Connect.

WePay has an article about their change data capture solution for MySQL, which uses Debezium to stream data to Kafka. WePay is on the Google Cloud Platform, so the MySQL instances are running in Google CLoudSQL, and from Kafka data is loaded into BigQuery. The post goes into the finer operational details, including how to add a new database to Debezium/Kafka, how they make use of the new global transaction IDs added in MySQL 5.6, and how streaming data that comes out of Debezium looks.

This presentation has a great overview of best practices and anti-patterns when it comes to Hadoop. There's a good graphical representation of small/medium/big data (and when Hadoop/Spark become appropriate), several alternatives to Hadoop for small/medium data, and a few slides on the cost/benefit of big data systems (i.e.. you won't have data integrity in a Hadoop cluster, you will likely need a data infra team of 4-5 people to run a cluster, but you'll see advantages of data centralization, fault tolerance, and programatic data access).

As the central source of truth for metadata about your data, it's quite important for the Hive Metastore to be up to date. Previously, this could be a challenge in a cloud environment in which there are multiple transient clusters that come and go unpredictably. Recently, Hadoop and CDH added support for a persistent Hive Metastore that lives independent of any one cluster. This post has some basics of configuring the metastore and the list of gotchas/assumptions to keep in mind.

This post describes APIs and tools for working with nested data in Spark and Spark SQL. It covers things like how to extract fields out of a nested struct and how to convert a json string to a struct on which normal operations can be performed.

The AWS Big Data blog has an overview of building a complex application (using 10 different AWS services from Amazon EMR to AWS CodePipeline) for data analysis, search and discovery, and more. The tutorial uses data from the Police Data Initiative.


trivup is a tool for programmatically building / tearing down a Kafka cluster. It supports Kafka's SSL authentication and encryption for client applications.

Version 0.2.0 of Apache Arrow, the in-memory columnar data format, has been released. This version has been validated to ensure compatibility across the Java and C++/Python implementations. The release also contains a new streaming binary format and better interoperability with Python pandas.

Apache Kafka was released. This is a significant release containing over 500 pull requests from over 100 authors. Major features of the release include a new Kafka Streams API for session windows, improved compatibility for Java clients, improved semantics for Kafka Streams joins, and single message transforms in Kafka Connect.

Apache Bahir 2.10 was released. Bahir is an add-on library for Apache Spark that brings support for Akka streaming, MQTT streaming, Twitter streaming, and ZeroMQ streaming.

Apache NiFi 0.7.2 and 1.1.2 were both released this week. Both releases include a fix for a NullPointerException.

Apache Cassandra announced patch releases for several major versions: 3.0.11, 2.2.9, and 2.1.17. There are quite a few improvements/fixes in each one (e.g. the 3.0.11 release resolves over 60 tickets).


Curated by Datadog ( )



Graph Talk w/ Uber Graph Team and Netflix (Palo Alto) - Thursday, March 2

Streaming with MapR and StreamSets Data Collector (San Jose) - Thursday, March 2


Bio-Manufacturing Optimization Using Apache NiFi, Kafka, and Spark (Tempe) - Wednesday, March 1


Real-Time Ingestion & Event Processing with Apache NiFi (Saint Louis) - Wednesday, March 1


Building Streaming Data Applications Using Kafka (Chicago) - Thursday, March 2


MongoDB and Reactive Streams with Kafka (Miami) - Tuesday, February 28


Big Data Journey: An Introduction to SQL on Hadoop (Reston) - Monday, February 27


Introduction to MapR (Toronto) - Thursday, March 2


Big Data and Machine Learning (London) - Tuesday, February 28


Stream Analytics, BI, and DevOps with Hadoop (Issy-les-Moulineaux) - Tuesday, February 28


Apache Flink's Stateful Operators and Table SQL API (Amsterdam) - Thursday, March 2


Data Wrangling & Spark^2 (Berlin) - Tuesday, February 28

Apache Kafka & Event Sourcing (Bonn) - Thursday, March 2


Lambda Architecture with Spark at Farmeron (Zagreb) - Wednesday, March 1


Stream All the Things with Lightbend (Athens) - Tuesday, February 28


March Clustering: Hadoop and Big Data (Jakarta) - Saturday, March 4


Spark 101 (Brisbane) - Tuesday, February 28


Learning to Scale and Learning from Scale (Christchurch) - Tuesday, February 28