Data Eng Weekly

Hadoop Weekly Issue #176

26 June 2016

Hadoop Summit is this week in San Jose so expect to see lots of announcements and presentations (please send any relevant slides my way!) in next week's issue. For this week's newsletter, there are some great posts on Kafka Streams, streaming data to Google BigQuery from Amazon Kinesis, and Google's Dataset Search system.


Shine has written about how they use Amazon Lambda and Amazon Kinesis, along with the Kinesis agent for the Apache web server, to move data from EC2 to Google BigQuery. The post has code snippets of Lambda functions (written in javascript), information on scale and cost, and a description of how to optimize transfer costs by gzip'ing data.

The Cloudera blog has a post that describes data analysis on fantasy sports data with Apache Spark, Apache Impala (incubating), and Hue. The post mainly focuses on the analysis, but it has a bit of Spark code and demonstrates some of the functionality available in Hue.

KDnuggets has an article that goes through 13 of the main APIs/projects/terms related to Apache Spark. These include RDD, DataFrame, Dataset, structured streaming, GraphX, and Tungsten. There are a few paragraphs on each, which is enough to get a good overview of Spark's main features.

This post from the Confluent blog looks at a simple but non-trivial application of Kafka Streams. Specifically, Kafka Streams is used to write a program that joins user click stream data with user location data. The latter is stored in a KTable, which provides an abstraction similar to a database table with a primary key (the latest value for each primary key is exposed via the APIs). The resulting program is quite simple—only a few lines of code.

The Cloudera blog has a post about's anomaly detection system for HTTP requests that's built on Apache Flume, Apache Spark Streaming, and Apache Impala (incubating). The code to implement the framework is available on github.

The AWS Big Data blog has a tutorial that shows how to process data from an Amazon Kinesis stream from an Amazon EMR cluster using Apache Spark and Apache Zeppelin. The post includes some example visualizations generated by executing SQL from the Zeppelin notebook.

Apache Kudu (incubating) is close to a 1.0 release that will fully support high availability. This post describes how the last piece of that puzzle, master replication, is implemented. The post also points folks to the JIRA issue that is tracking the work and gives a brief overview of what implementation and testing is remaining.

Google has over 26 billion data sets across all of their data platforms, and they're adding and removing 1.6 billion dataset paths every day. To track, search, and compare datasets, they've developed the Google Dataset Search (GOODS). GOODS tracks metadata, which is exposed via an API, and can be used for search, monitoring, and more.


SiliconAngle has an interview with Hortonworks CEO Rob Bearden. Topics discussed include industry trends, Hortonworks financials, non-Hadoop tech at Hortonworks, and Internet Of Things.


Apache Sentry 1.7.0 was released this week with bug fixes, new features, and improvements. Among them, this release upgrades to v2 of the Hive authorization framework.

DataStax Enterprise 5.0, which is based on Apache Cassandra 3.0, adds support for Graph data, tiered storage,a and multi-instance for Cassandra. The release also includes additional security features like encryption and role based access control.

Driven, the big data application performance monitoring system, has released version 2.2. The highlight of this release is general availability of support for Apache Spark in Driven.

BlueData has announced the release of their EPIC Enterprise Big Data as a Service product for Amazon Web Services. The software can be used for automatically provisioning Docker-based Hadoop clusters with a few clicks.

Apache Accumulo 1.7.2 was released. It includes but fixes to write-ahead log handling, optimizations for RFiles, and minor performance improvements.

Versions 2.11.0 and 3.2.0 of Apache Curator, the high-level SDK for Apache ZooKeeper, has been released.,2016,Releases2.11.0and3.2.0available

Apache Hive 2.1.0 was released. It includes a large number of bug fixes and improvements, including changes to Hive's Live Longer and Prosper as well as JDBC support.


Curated by Datadog ( )



Apache Metron Overview and Demo @ Hadoop Summit - Monday, June 27

Apache Accumulo Meetup at Hadoop Summit (San Jose) - Monday, June 27

Building Big Data Applications with Apache Beam and Apache Apex (San Jose) - Monday, June 27

Robust Stream Processing with Apache Flink and Flink-Htm (San Jose) - Monday, June 27

Apache Ambari Meetup at Hadoop Summit (San Jose) - Monday, June 27

War Stories of Making Software Work with Hadoop (San Jose) - Monday, June 27

NiFi Meetup at Hadoop Summit (San Jose) - Monday, June 27

North Carolina

Introduction to Spark In-Memory Computing (Durham) - Tuesday, June 28

New York

How to Use Apache Ignite, In-Memory Data Fabric (New York) - Tuesday, June 28


Toronto Apache Spark #10 (Toronto) - Wednesday, June 29


Crowdmix: An Event-Based Social Music Platform & Kafka 0.10 New Features (London) - Tuesday, June 28


Introduction to Spark 2.0 (Bangalore) - Saturday, July 2


Shanghai BigData Streaming 3rd Meetup (Shanghai) - Saturday, July 2