Data Eng Weekly

Hadoop Weekly Issue #211

02 April 2017

Lots of posts this week on stream processing—frameworks are maturing and more and more people are writing about applications. There are also a number of releases this week, including the 2.8.0 Apache Hadoop release. In news, there's a new Hadoop online training course, Cloudera has filed for their IPO, and more.


This article describes stream processing (including concepts like event time, state, and windowing from the Apache Beam model) and gives a brief overview of Apache Spark Streaming, Apache Flink, and Apache Kafka Streams.

The Cask Data Application Platform (CDAP) has support for running Python scripts both as part of a processing pipeline and via MapReduce or Spark. This post provides an overview of how the functionality is built into CDAP.

This tutorial on the Cloudera blog shows how to build a search and analytics system using open-source tools from the Hadoop ecosystem. Specifically, data is ingested via Flume's syslog source, sent to a Kafka cluster, stored to HBase for archival, and also written to Solr and OpenTSDb via a Spark Streaming job. HUE and Grafana are used for visualization.

The Confluent blog has a guest post that describes how to secure the Confluent Schema Registry (exposing it over HTTPS) and configure it to connect to secure ZooKeeper and Kafka clusters (including setting up of per-topic ACLs required for the Schema Registry's data storage).

The Apache Flink blog has a post on Flink's Table APIs. Built on Apache Calcite, the table APIs support both batch and streaming use cases. For the latter, Flink has out of the box support for sliding window, tumbling window, and session windows (each supports both event and processing time).The post also has a look at Flink 1.2's "User-defined Table Functions" feature.

In a good follow up to the previous post, this article provides examples and visualizations of of how Flink's session windows work. It also covers how to reprocess data (including when data is no longer in Kafka and needs to be read from a file system) and visualizes how a join works.

The Databricks blog has a guest post from the team at the Dollar Shave Club about how they connect Apache Spark (via Databricks) to Amazon Redshift for machine learning applications.

Amazon has a post on how to use their Amazon Key Management Service to encrypt data as it goes into Amazon Kinesis (and decrypt as it comes out).


This post has an interesting look at the complexity in open source caused by dependencies between project communities. Hadoop hasn't been the only game in town for a while, and dependencies (and thus bug fixes or new features) can end up being several project deep.

Early bird registration for Spark Summit, which takes place in June, ends this week. The conference is three days at the Moscone Center in San Francisco, and there are nine tracks.

Registration is now open for HBaseCon West, which takes place in Mountain View on June 12th. The call for Papers is open through April 24th.

The Linux Foundation, which runs the ODPi project, has announced a new free online course "Intro to Apache Hadoop" through edX.

Cloudera has filed their S-1 IPO paperwork, and their IPO is anticipated for sometime in the next few weeks. TechCrunch has looked into the filing and highlighted some details (including a bit about the partnership with Intel). The stock is set to trade under the ticker CLDR.


Apache Hadoop 2.8.0 was released. It includes a number of improvements and new features across Hadoop Common (including improvements to the S3A file system and new integration with Azure Data Lake), HDFS (several enhancements to WeHDFS), YARN, and MapReduce.

Apache HBase 1.2.5 includes 50 resolved issues, several of which are critical bug fixes.

Apache Storm 1.1.0 is out with support for streaming SQL, improvements to the Apache Kafka integration, new integration with Druid, OpenTSDB, AWS Kinesis, and more.

Version 0.8-incubating of Apache Atlas was released. Atlas is a governance and metadata framework for Hadoop, which provides features like centralized audiding and data lineage.

Apache Zeppelin 0.7.1 was released. The new version has a rewriten python interpreter, improved stability, and a number of others fixes.


Curated by Datadog ( )



Apex Big Data World (Mountain View) - Tuesday, April 4

Apache Flink: Building a Stream Processor for Event-Driven Applications (San Francisco) - Thursday, April 6


Microservices With Hadoop (St. Louis) - Wednesday, April 5


Big Data & Predictive Analytics 2017 (Arlington) - Tuesday, April 4


Data Engineering London Meetup (London) - Tuesday, April 4


Big Data Analytics for Small & Medium Sized Enterprises. Is It Real? (Oslo) - Monday, April 3


Simple Real-Time Analytics With Spark + Build the Simplest Data Pipeline (Stockholm) - Thursday, April 6


The Hadoop Ecosystem (Madrid) - Monday, April 3

Machine Learning & PySpark With Spark Committer Holden Karau (Madrid) - Wednesday, April 5


Spark Meetup with Cloudera and Talend (Paris) - Monday, April 3

Meetup #5 (Tourcoing) - Thursday, April 6


Accelerating Apache Big Data and Machine Learning Workloads on OpenPOWER (Munich) - Tuesday, April 4

HS Munich: Tensorflow and Hadoop, Accelerating HBase, Hive vs. Spark (Munich) - Tuesday, April 4

Data Science With Spark on Hadoop (Munich) - Tuesday, April 4

After–Hadoop Summit Meetup: IoT, Data Science, Spark, Flink, Deep Learning (Munich) - Wednesday, April 5


Asynchronous Queue Solution with Kafka + Advanced Data Analysis on Hadoop Clusters (Brno) - Wednesday, April 5

Seznam’s Euphoria API (Prague) - Thursday, April 6


GeekNight: Converged Data Platform (Hyderabad) - Wednesday, April 5


IoT with Talend: Integrating Real-Time Data Streams With Spark and Kafka (Singapore) - Thursday, April 6


Kafka & Kafka Streams (Sydney) - Wednesday, April 5