Data Eng Weekly

Hadoop Weekly Issue #164

03 April 2016

Although Strata+Hadoop World was this week in San Jose, there were only a few announcements and releases (or maybe I'm bad at tracking them). The organizers are collecting slides from the presentations (see link below), but if there were particularly good sessions that I should highlight for next week's issue please let me know. In any case, this week we have lots of great articles covering Apex, Flink, Spark, HBase, and more.


A post on the DataTorrent blog describes how Apache Apex, the stream processing framework, calculates the processing latency of a streaming application. In short, Apex calculates the latency of each operator using a control tuple, and it aggregates these across the application DAG to compute the latency for the entire application.

This post aims to be a comprehensive comparison of Apache streaming technologies, such as Flume, Apex, Spark Streaming, and Flink. In addition to the comparison matrix, there's a list of articles and other resources about several stream processing systems.

A two-part series on the Hortonworks blog looks at Hadoop in healthcare. The post describes some of the data issues (both volume/velocity/variety and data silos) in the industry, the types of data folks are looking at, the opportunity for the big data and healthcare, and why Hadoop is a good tool for solving these problems.

This post describes an effort to add a new implementation of k-nearest neighbors to Apache Flink. Using quadtrees (which are described in the post), the amount of communication overhead during the computation can be reduced, which leads to drastically decreased runtime.

The AWS big data blog has a tutorial describing how to setup SparkR on EMR with the RStudio IDE. There's an automated bootstrap action, a description of how to connect to the cluster, and example code to do some basic tasks from R.

A medium post describes how Salesforce is using Apache HBase. It describes why they chose HBase over other NoSQL stores, when they use HBase (i.e. what types of use-cases they recommend for it), and some of the Salesforce applications/features powered by HBase.

"Rapid Data Analytics @ Netflix" is a presentation (slides and video) that focusses almost entirely on culture rather than technology. For example, the Netflix data team decided that the benefits of letting everyone have admin privileges on the data warehouse outweigh the costs. Given that setup, they have put in place a fast backup/restore strategy in case someone makes a mistake (such as dropping a table). There are several other revisited assumptions, such as their 'on-call' strategy.

ADAM is an in-memory mapreduce framework for genomic analysis. This post describes configuring ADAM on Spark on a Amazon EMR cluster and using the adam-shell.

The Strata+Hadoop website has an index of all of the presentations for which speakers have published slides. There are presentations on Spark, Flink, Spark Streaming, Kafka, Hadoop in the cloud, and more.


ODPi hass announced the first version of the ODPi Runtime Specification, which is based on Apache Hadoop 2.7. It covers Hadoop common, HDFS, YARN, and MapReduce.

The ODPi is now a Linux Foundation Collaborative Project. InfoWorld has a discussion of why this helps make (at least perception-wise) the project more vendor-independent.

Confluent has announced Confluent University, a new training program for Kafka development and operations. There are upcoming trainings in New York, San Francisco, Austin, and Redwood City.

data Artisans, the company founded by the creators of Apache Flink, has announced a 5.5 million euro round of Series A financing.

Apache Sentry, the fine-grained authentication framework for the Hadoop ecosystem, has graduated from the Apache incubator. The Apache blog has more about the progress of the project while in the incubator ands its future trajectory.

DataBricks has created a published a new free (behind an email-wall) eBook, "Apache Spark Analytics Made Simple."

The call for speakers for Strata+Hadoop World New York, which takes place in September, ends in just over a week at 11:59pm EDT on April 11th.


Apache NiFi 0.6.0 was released this week. It adds support for Kerberos Authentication for its REST API, includes several updates and stability improvements, and adds new support for Amazon Kinesis, AWS Lambda, Splunk, and Apache Cassandra.

Based on the Apache NiFi 0.6.0 release, Hortonworks has released version 1.2 of Hortonworks DataFlow.

flink-htm is a new library for streaming anomaly detection/prediction, based on Hierarchical Temporal Memory (HTM) algorithms, with Apache Flink.

gogen-avro is a new experimental library for generating Go structs based on avro type definitions to provide a nicer API and speed up encoding data.


Curated by Datadog ( )



New Features in Flink 1.0.0 + Recent Performance Benchmarks (San Francisco) - Tuesday, April 5

Next-Generation Python Big Data Tooling, Powered by Apache Arrow (San Francisco) - Tuesday, April 5

Introduction to Apache Apex: The Next Generation Native Hadoop Platform (Fremont) - Tuesday, April 5

Apache Flink 1.0.0, MapR Streams and Recent Benchmarks (San Jose) - Wednesday, April 6

Apache Apex Double Feature: Fault Tolerance and Kafka Integration (San Francisco) - Wednesday, April 6


Efficient State Management with Spark 2.0 (Portland) - Thursday, April 7


Hadoop Options on Azure (Tempe) - Wednesday, April 6


Real World Big Data at Sonic: Learn More and Remove Duplicates with Spark (Oklahoma City) - Thursday, April 7


Spark Streaming: A Practical Example (Saint Louis) - Wednesday, April 6


What Is All the Hype about Apache Spark (Chicago) - Thursday, April 7


Data Science at Scale with Spark (Milwaukee) - Tuesday, April 5


Replicating Relational Database Binary Logs to Kafka (Mclean) - Thursday, April 7

New Jersey

Deep Dive Avro and Parquet: Read Avro/Write Parquet Using Kafka and Spark (Hamilton Township) - Tuesday, April 5

New York

Using Apache Spark for Mastering Customer Data (New York) - Wednesday, April 6


Hadoop Deployments in Real World Scenarios (Kitchener) - Tuesday, April 5


From the Source: Learn about Apache Flink from a Project Committer (London) - Thursday, April 7


Big Data & Real Time Analytics at (Berlin) - Wednesday, April 6


Hadoop Ecosystem Essentials & Workshop (Istanbul) - Saturday, April 9


In Memory OLAP on Hadoop Using Spark (Herzelia) - Tuesday, April 5

Shuffling Spark with Kafka, Standalone Spark Approach (Tel Aviv-Yafo) - Tuesday, April 5


Shanghai Spark Meetup (Shanghai) - Saturday, April 9


Adelaide Apache Spark User Group 2016 Kickoff (Adelaide) - Wednesday, April 6