Data Eng Weekly

Hadoop Weekly Issue #200

15 January 2017

This week's issue is #200 and also marks nearly four years of Hadoop Weekly. I certainly couldn't have predicted where the Hadoop ecosystem would be today, but it's clear that Hadoop has staying power. Who knows where things will be in another 200 issues... but chances are we'll still be talking about YARN, Kafka, and Spark (among others) covered in this week's content.


This article provides an in-depth tutorial walking through the steps of configuring an Apache Hadoop cluster using Apache Ambari. Using LXC and password-less ssh makes for an easy way to do a proof-of-concept.

The morning paper covered the 2013 publication on YARN from the ACM Symposium on Cloud Computing. If you're unfamiliar with the history/motivation of YARN and some of the core concepts, this is a great recap of those pieces.

The MapR blog has a two-part series on integrating complex event processing into a streaming architecture using the open-source Drools engine. As an example use-case, the tutorial has a script for generating synthetic sensor data related to road traffic. Data is ingested using StreamSets and can be visualized using Kibana.

Shasta is the name of Google's system for OLAP and OLTP atop of their RDBMS (called F1) and other data systems. A key part of Shasta is the Relational View Language, which is much more flexible than SQL (supporting variables, aliases and more). The morning paper has a summary of the Shasta paper, including the goals of Shasta and RVL as well as how the framework compares to other systems.

Stream processing system Apache Apex recently added support for SQL by integrating with Apache Calcite. This post describes the integration, describes how to use it, and provides an example code snippet that runs a basic SQL statement.

In a meta getting started guide for Apache Kafka, this post points to several different pieces of documentation for someone who wants to learn Kafka. There are collections of resources for software engineers, sys admins, data engineers, and more. The post includes some useful experiments to test how well you've understood the getting started content.


"Practical Data Science with Hadoop and Spark" was published towards the end of last year and is now available on Amazon, Barnes & Noble, and more.

Hortonworks has posted their 2016 year in review, which covers their vendor partners, product updates, cloud initiatives, and business numbers. Of note, they now have over 1,000 employees and over 1,800 companies that have joined Partnerworks.

Spark Summit East in in Boston in just under a month. The Databricks blog highlights five talks from the schedule.

Apache Eagle, which is a Hadoop ecosystem project that's flown a bit under my radar, was promoted to be a top-level Apache project this week. Using metrics and logs, Eagle provides security and performance analysis of Hadoop, Spark, YARN, and more.

Apache Beam also graduated from the Apache incubator this week. Beam was under incubation for just under a year, and has undergone a few releases in that time.

ZDNet has an article that provides some more background on Apache Beam, including its origins at Google. It highlights that while Beam supports Spark Streaming as an execution engine, it's also a potential competitor to it.

This post provides a brief synopsis of 11 books on Apache Spark.


Uber has open-sourced Cherami, their message queue system. It fills a similar role to Amazon SQS (having replaced Celery at Uber) for running tasks. The introductory post describes the architecture and design trade-offs in detail. Notably, it uses RocksDB for storage and Apache Cassandra for storing metadata.

In addition to graduating from the incubator, Apache Beam announced version 0.4.0 this week. The new release adds support for Apache Apex, which is currently supported in embedded (single JVM mode) with YARN support planned as a fast follow-on. The release contains quite a few additional bug fixes and improvements.


Curated by Datadog ( )



Apache Kafka Meetup with Walmart Labs and Confluent (Sunnyvale) - Wednesday, January 18

Apache Edgent: Streaming and Analytics for IoT Devices (San Francisco) - Wednesday, January 18

gRPC, Kubernetes, Mesos, Spark ML, Tensorflow, HDFS, Kafka (San Francisco) - Thursday, January 19


Apache Spark Lightning Talks (Seattle) - Tuesday, January 17


Fast Data: Selecting the Right Streaming Technologies for Never-Ending Data Sets (Chicago) - Tuesday, January 17

Spark Discussion with Dr. Alex Liu, IBM's Chief Data Scientist (Chicago) - Tuesday, January 17


Apache Impala/IoT Data Analytics: Hype or Truly Transformative (Green Bay) - Tuesday, January 17

Apache Spark Machine Learning Blueprints (Milwaukee) - Wednesday, January 18


Building Reactive Fast Data & The Data Lake with Akka, Kafka, Spark (Atlanta) - Tuesday, January 17

Evolving Beyond the Data Lake (Atlanta) - Wednesday, January 18


A State of the Union Panel Discussion on Apache Spark (Tysons) - Thursday, January 19

New York

Fast Data/Event-Driven Architecture (New York) - Thursday, January 19


Introduction to MapR (Toronto) - Thursday, January 19


Instrumenting Apache Kafka (London) - Wednesday, January 18

Staging Reactive Data Pipelines Using Kafka as the Backbone (Manchester) - Wednesday, January 18


Meetup #2 (Villeneuve d’Ascq) - Thursday, January 19


Running Kafka in Production and Streamsets (Brussels) - Wednesday, January 18


Sensor Data Ingestion and Processing with NiFi and Spark (Amsterdam) - Tuesday, January 17


How to Apply Machine Learning to Real-Time Processing (Nuremberg) - Tuesday, January 17


Reintroduction to Hadoop Zoo (Prague) - Thursday, January 19


Introduction to the Hadoop Ecosystem (Warsaw) - Thursday, January 19


Airflow & Luigi: The Flow of Your Data (Tel Aviv-Yafo) - Wednesday, January 18