Data Eng Weekly

Hadoop Weekly Issue #208

12 March 2017

Strata+Hadoop World is this week in San Jose. Please keep an eye out for talks and announcements to send my way (some have already snuck in the week before). But in the mean time, there are two great papers this week covered by The Morning Paper, several articles on YARN, a post about the history of the HDFS audit log, and a discussion of HBase optimizations at Alibaba.


The StreamSets data collector can be used to do change data capture from MongoDB and write to a configurable destination system. This post shows how you can mirror your data to MapR DB as JSON.

The Morning Paper has a summary of the HopFS paper that was recently presented at the USENIX conference on File and Storage Technology (FAST). Based on Apache Hadoop DFS version 2.0.4, HopFS replaces the NameNode/JournalNode/Zookeeper cluster with stateless namenodes that store metadata in a MySQL cluster for high availability. The summary has some interesting notes on the more subtle design considerations for locking/transactions. The authors analyzed performance based on real-world recordings of metadata operations on a large Hadoop cluster at Spotify.

Hortonworks has two posts on the future of YARN. The first introduces the notion of "multi-colored" YARN, which is the promise that YARN can take on even more types of applications and workloads, especially driven by work to add Docker support. The next article describes one of these workloads—a GPU-heavy TensorFlow application, and provides a tutorial and sample code for getting started with TensorFlow on YARN.

Using the Attach API extension to the JVM, the application described in this post monitors the heap values in a JMX MBean to detect when an out of memory situation is about to happen.

Automated testing with a distributed system is usually pretty tricky. This post describes a setup involving Docker, Gitlab CI, and Druid for testing via Python.

Another paper from FAST looks at what happens to several popular distributed systems (including Kafka, ZooKeeper, and Cassandra) in the face of data corruption or local FS errors. Of the systems tested, everyone one experienced some kind of failure due to assumptions about the underlying storage.

The HDFS audit log is a great tool for tracking and debugging what is going on in your HDFS cluster. This post describes how it came out of a need at Yahoo! (and includes some interesting history about the early days of Hadoop).

The Apache Software Foundation blog has a post from Alibaba about their usage of HBase, which powers a number of different applications in their e-commerce search infrastructure. Recently, the undertook an initiative to backport a patch to improve throughput and reduce garbage collection by avoiding the heap when reading data from the HBase Level 2 Bucket Cache. The post has more details on the internals of HBase caching, and it shares the type of improvement they saw from the implementation.

The AWS Big Data blog has a post that puts together Amazon Kinesis Firehouse, Amazon Athena, and Amazon QuickSight to analyze VPC Flow Logs. If you've ever tried to analyze these logs by hand like I have, then you're probably excited by the potential of this setup.

This post has an in-depth look at YARN node labels. It describes how they work at a high-level, how they relate to the capacity scheduler's job queues, the settings that can be used adjust labels with/queues, and more.


Rumors are swirling that Cloudera is about to announce their IPO.

Confluent announced that they've raised a $50 million series C. The post by Confluent CEO Jay Kreps describes what Confluent has been up to over the past year.


Apache Ignite, the in-memory data grid, released version 1.9 with support for Kubernetes, better API support for .NET and C++, improved compatibility with Spark, and more.

Unravel has announced a new version of their platform which includes an Application Performance Monitoring component. They're touting it as the first such system for big data.

PepperData also announced a new application profiler tool for Hadoop based on the open-source Dr. Elephant.

The new version of BlueData EPIC supports hybrid clouds where some data/clusters are in AWS and some are on-premise.


Curated by Datadog ( )



SF Spark+SF Hadoop Joint Meetup at Strata: Apache Spot and Kudu (San Francisco) - Monday, March 13

Airflow Meetup (San Jose) - Tuesday, March 14

Apache Kafka Meetup at Strata San Jose (San Jose) - Tuesday, March 14

Big Data Science @ Strata + Hadoop World (San Jose) - Tuesday, March 14

Hadoop on Bare Metal + Data Prep with Spark (Irvine) - Wednesday, March 15

Dr. Elephant Meetup: Strata + Hadoop World Edition (Sunnyvale) - Thursday, March 16

100% Streaming!!! Google Cloud + Apache Beam + Flink (San Francisco) - Thursday, March 16


Spark Meetup 2017 #1 (Mason) - Wednesday, March 15

New Jersey

NiFi: Flow Programming and Financial Trade Routing (Hamilton) - Thursday, March 16


Spark on AWS (Montreal) - Wednesday, March 15


Hadoop Deep Dive Part I: Motivation and Fundamentals (Medellin) - Tuesday, March 14


Hadoop Meetup with Cloudera and Synaltic (Paris) - Wednesday, March 15

Apache Flink: Use Cases (Neuilly-Sur-Seine) - Wednesday, March 15


10th BigData/DataScience­­­ Meetup (Cluj-Napoca) - Wednesday, March 15


Spark Runbook: Learnings From 5 Years of Spark for Analytics (Singapore) - Monday, March 13