Data Eng Weekly

Hadoop Weekly Issue #225

22 July 2017

Apache Hive and Apache Hadoop both had releases this week, and there a number of great articles on Presto, Airflow, Kafka, and more. There are a bit more vendor posts than normal—please keep sending me links to help keep content diverse!


Good read on one team's path to adopting Presto. They describe several ways that they evaluated Presto (and other options), including performance, security, testing with users (their analysts). There are also some takeaways based on their production rollout.

The recently released Hue 4 has a new interface and some impressive new features like autocomplete for SQL (including proposing joins based on popularity) and a data wiki/governance tool.

Recent versions of Apache Kafka have added backwards and forwards compatibility of Kafka Clients. This post describes how it was added to the protocol, how to enable the feature, and what to expect when using a client that is a different version from the brokers.

Apache Airflow is gaining popularity as a workflow engine for big data, so it's no surprise that the Databricks team has built an integration. This is a good overview of what it looks like to get started with Airflow (including key concepts), using Databricks integration as the example.

An interesting look at a real world use case, this post describes how Rabobank has moved their customer alerts system from the mainframe to Apache Kafka (built with a multi-data center deploy and Kafka Streams). The code, which is just under 75 lines of Java, is concise and easy to follow. The new system generates alerts in under 5 seconds, versus minutes or hours on the old one.

Amazon EMR has supported using S3 as the data store for HBase clusters for a while now, and with the latest release (more details below), HBase Read Replicas can be created using HBase data in S3. This tutorial gives a walkthrough of HBase Read Replicas using Amazon EMR.

This post has a good overview of an architecture that supports real-time processing of sensor data to detect anomalies. It uses StreamSets for ingesting, H2O (there's sample code on github) for building a model, and Spark Streaming for evaluating the model at scale.

This post describes many of the major features of Apache Kafka including its ordering/durability guarantees, key compaction, and schema evolution. In addition, there are diagrams to explain each of the major concepts/features.

Amazon Redshift Spectrum is a service for querying data in S3 using the same underlying engine as classic Redshift. This post gives details on Spectrum's architecture and what its advantages are (including eliminating ETL and allowing to scale compute and storage independently).

Qubole has written about some of the challenges (and fixes) they've come across as they try to run Apache Airflow at scale.


DataEngConf NYC, which takes place on October 30 and 31, has announced the Call for Proposals. Talks are 30 minutes, and the CFP closes on September 1st.

Dremio, who is a force behind the Apache Arrow in-memory format, has announced their first product, a BI query engine that sits as a gateway between a BI tool and a backend stores (including RDBMS, Hive, HBase, S3, and more).


Amazon EMR 5.7.0 was released. The major new feature in this release is the ability to create a read-replica HBase cluster when the cluster is storing data in S3. Other updates include support for Apache Flink 1.3.0 and new versions of Apache Zeppelin and APache Phoenix.

Cask has announced support for Microsoft Azure in addition to Amazon Web Services for running their CDAP Cloud Sandbox.

Implyr is a new package that enables querying of Impala from R using dplyr. This post has a brief introduction to the integration.

Apache Hive 2.3.0 was announced. It's a large release, containing a large number of bug fixes and improvements, and a few new features (including support for set operations and materialized views).

Apache Hadoop 2.8.1 was released with critical security fixes atop the 2.8.0 release. The 2.8.2 release, which will have more fixes, is currently targeted for an August release.


Curated by Datadog ( )



Self-Service Data Integration, Kodiak Data, and Kubernetes! (Palo Alto) - Wednesday, July 19


Sensors, Spark, and Kafka: Applied Machine Learning (Minnetonka) - Wednesday, July 19


Data Streaming Panel Event (Chicago) - Thursday, July 20

North Carolina

SQL on Hadoop and Modern Analytic Databases with Ian Cook (Chapel Hill) - Tuesday, July 18


DATA Rave Party: Spark Edition (Sao Paulo) - Thursday, July 20


Applied Data Engineering #1 (London) - Wednesday, July 19


The State of Spark and Hive in the Cloud, by Nico Poggi (Barcelona) - Thursday, July 20


July Meetup: Scala, Spark, Docker, Elasticsearch (Bucharest) - Thursday, July 20


Big Data Processing on AWS (Tel Aviv-Yafo) - Tuesday, July 18

Kafka Operations (Herzeliyya) - Wednesday, July 19

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit