Data Eng Weekly

Hadoop Weekly Issue #170

15 May 2016

Stream processing has been dominating this newsletter recently, but this is a more balanced issue. There are several articles about Apache HBase, a post on Apache Atlas, and a preview of Spark 2.0. On the stream processing front, King has written about their Rule-Based Event Aggregator, AWS has published on Apache Ignite with DynamoDB, and Twitter has open sourced their DistributedLog platform.


King, makers of Candy Crush and other game franchises, have written a post about their real-time Rule-Based Event Aggregator (RBEA). Built on Apache Flink and Apache Kafka, RBEA exposes a web interface for defining, starting, stopping, and investigating rules written as Groovy scripts. These scripts are hot-deployed by serializing them to a particular Kafka topic. RBEA uses Flink's advanced windowing capabilities, and there are code snippets of this and other examples in the post.

The Cloudera blog has a post demonstrating the trade-offs of storing data in wide columns (a single Apache Avro object serialized to bytes in an HBase cell) vs. narrow columns (each field of the Avro object placed in its own cell).

Data lineage, or tracking the path/history of data across systems, is incredibly useful but difficult to implement. Apache Atlas aims to change this, at least if you're using a number of components of the Hadoop ecosystem such as Sqoop, Hive, Kafka, Storm, and Falcon. The Hortonworks blog has more info about this upcoming support in Falcon.

This tutorial describes how to configure Apache NiFi with Apache Solr and Lucidworks Banana (Kibana port for Solr).

The Cloudera blog has a post about how the Santander UK team uses HBase for real-time inserts and queries. The post describes their schema design for transactions and trends, and how they use HBase coprocessors to pre-compute trends. The post has a number of details about the implementation, performance, and lessons learned.

This post describes the security model of the Apache HBase REST API gateway, and it provides a tutorial for configuring and using it when secured with Kerberos.

The Silicon Valley Data Science blog has a post describing how to build a streaming regression model for predicting MeetUp RSVP volume using the streaming API. The demo is built with Apache Kafka, Spark, Kudu, and Impala. Predictions are built using Spark's MLlib and the MADlib library with Impala.

The AWS Big Data Blog has a post on combining Amazon DynamDB, Kinesis, and Redshift with Apache Ignite and Apache Zeppelin for in-memory, real-time analytics. The post includes CloudFormation scripts for building an example Ignite cluster and some sample code for performing the analysis.

Databricks has made a preview of Spark 2.0 available on their platform. The introductory post has an overview of new features in the upcoming release, which is useful even if you're not a customer. Features include streamlined APIs, faster code via the next gen Tungsten engine, and structured streaming.


Given its reliance on Apache HDFS, Apache HBase has the reputation of being complex to deploy. This post explores the notion that HBase's close ties to Hadoop is the cause of both its initial adoption and recent slowing adoption (compared to systems like MongoDB).

Apache Big Data North America was this week in Vancouver. has an article recapping one of the keynotes, including an interview with the speaker.

Apache MRUnit, the unit testing framework for Hadoop jobs, has been retired to the Apache Attic.

The Open Data Platform Initiative (ODPi) became a gold sponsor of the Apache Software Foundation (ASF) this week. This post, which includes an interview with the ODPi Director of Program Management, explores the relationship between the ODPi, the ASF, and several of the Hadoop vendors.

PredictionIO, a machine learning server built on Spark, HBase, and Spray, has been submitted to the Apache incubator. While the software has been open-source for a while, the company behind the software was recently aquired by Salesforce.

Videos and slides for presentations and keynotes at Kafka Summit have been posted online. Videos are behind a email-wall.


Apache Storm released two new versions in the past week or so. Version 0.10.1 includes a number of bug fixes. Version 1.0.1, being the first dot-release since 1.0, includes bug fixes, performance improvements, and more.

Apache Ambari 2.2.2 was released last week. One of the new features of the release is support for Grafana to query and visualize the contents of the Ambari Metrics system. The Hortonworks blog has more on this new integration.

Apache Flink 1.0.3 was released to address a number of bug fixes and to improve documentation.

Hue 3.10 was released this week. Among the new features are a fully-revamped SQL Editor and a new UI for the SQL Browser.

The Apache Apex 3.4.0 release resolves a security issue with a 3rd party library. Due to the nature of the library upgrade, this is a backwards incompatible upgrade. See the release announcement for more details.

Spring Cloud Stream has hit general availability with the 1.0.0 release. It provides a stream processing framework for a number of underlying messaging systems, including RabbitMQ, Apache Kafka, and Redis.

Twitter has open-soruced DistributedLog, their "high-performance, replicated log service." Twitter has previously written about DistributedLog's model and features (which are very similar to Apache Kafka).

MapR announced support for Apache Spark 1.6.1 on the MapR Converged Data Platform.


Curated by Datadog ( )



May Kafka Meetup (San Francisco) - Tuesday, May 17

Apache Spark: Starting Your Big Data Journey (San Jose) - Tuesday, May 17

#OCBigData Meetup #17 (Irvine) - Wednesday, May 18


5 Ways to Use Spark to Enrich Your Cassandra Environment (Dallas) - Wednesday, May 18

Laying Down the SMACK on Your Data Pipelines (Austin) - Thursday, May 19


What's Coming in Spark 2.0 (Chicago) - Wednesday, May 18


Apache Drill and Apache Spark (Green Bay) - Tuesday, May 17


The Big Data Puzzle: Where Does the Eclipse Piece Fit? (Huntsville) - Tuesday, May 17

New Jersey

Spark Workshop: Broadcast, Accumulators, Future of RDD (Princeton) - Thursday, May 19

New York

Hands-On Intro to Big Data Analytics Using Apache Spark and Apache Zeppelin (New York) - Thursday, May 19


Apache Kudu (Montreal) - Tuesday, May 17


Python + Redis/Kafka/Flink (Madrid) - Tuesday, May 17

Apache Flink Workshop (Madrid) - Friday, May 20


First Meetup: Presentation on Apache Kafka (Paris) - Tuesday, May 17


Growing Into a Proactive Data Platform (Ra'anana) - Monday, May 16

Spam Detection with Kafka and Samza + “Your Data Isn't That Big” (Tel Aviv-Yafo) - Monday, May 16

Hadoop: Still the Core of Big Data (Tel Aviv-Yafo) - Wednesday, May 18