Data Eng Weekly

Hadoop Weekly Issue #185

05 September 2016

This is a short and sweet issue covering Facebook's usage of Spark, a comparison between Apache Flink and Apache Kafka Streams, new releases of Apache Ambari and Hortonworks Data Flow, and more. With the three-day weekend in the US, I expect it to be another slow news send anything interesting you find my way!


The Cloudera blog has a post describing how they use LD_PRELOAD to load a library for dynamic fault injection into file operations. Using the tool, they add delay, introduce corruption, and simulate disk access failure.

The Databricks blog has a bi-weekly recap of Spark news. There are links to a number of presentations, interviews, and blog posts covering Spark packages, Riak integration, and more.

Facebook has a fascinating post about how they've converted their massive feature preparation for entity ranking job from Hive to Spark. This job consumes 60TB of compressed data, and was a multi-day job on Hive. Along the way to achieving a 5x speedup in Spark, the Facebook team found and fixed a number of performance bottlenecks.

In this post, two very smart folks from the distributed systems/stream processing ecosystem discuss Apache Kafka Streams and Apache Flink. They give a brief introduction to the two frameworks, discuss the intended use cases, and describe other characteristics such as the deployment model. Given the number of processing frameworks available, this guide provides good context for what to use and criteria for evaluating a particular framework.

NDBench is a new open-source project from Netflix. It's used to benchmark data systems like Cassandra, Redis, and Elasticsearch for throughput and latency. Netflix uses it to validate changes from large operating system migrations to new AMIs built with new versions of data applications.

TLA+ "is a formal specification language developed to design, model, document, and verify concurrent systems." The Dr. TLA+ Series is covering a number of protocols with slides, videos, and notes being posted online. So far, Paxos, Raft, and Fast Paxos have been covered.


Paige Roberts has an interview with Yolanda Davis of Hortonworks about her work on Hortonworks Data Flow and Apache NiFi. The interview covers some one of the differentiators of NiFi—tracking and lineage—and future plans to integrate with Apache Atlas. Paige and Yolanda also discuss experiences from the Women in Big Data lunch and volunteering with organizations like Black Girls Code.

Cassandra Summit 2016 is this week in San Jose. Speakers include folks from Apple, Netflix, Knewton, Uber, and Spotify.


Version 2.3.1 of Luigi, the workflow engine, was released with a few bug fixes.

Apache Drill, the SQL engine, released version 1.8.0 this week. The new version adds a number of new features, improvements to startup scripts, and resolves 20+ bugs.

Databricks has announced a new workflow API for their notebooks, which allows chaining operations defined in notebooks and tracking entire workflows. The introductory post includes more details of the features.

Hortonworks Data Platform 2.5 is now generally available. The version is based on Apache Hadoop 2.7.3, and includes new support for Zeppelin, and updates to the majority of components. The Hortonworks blog has more details about the major features of the release.

Apache Ambari 2.4 was also recently released, and it includes a number of new features. The highlights are log search, role-based access control, improved operational insights, and customizable alerts.


Curated by Datadog ( )



Apache Metron Cybersecurity Overview and Codelab: Meet the Experts! (Santa Clara) - Wednesday, September 7

NoSQL vs Hadoop Ecosystem (San Carlos) - Wednesday, September 7

Women in Big Data: Intermediate Data Science Class (Santa Clara) - Friday, September 9


New Data & More Efficient Analysis with Data Wrangling (Tempe) - Wednesday, September 7


Akka Streams & Bloom Filters + ML Pipeline for Text Classification (Broomfield) - Wednesday, September 7

Building a Streaming Systems of Record with MapR Streams, with Will Ochandarena (Boulder) - Thursday, September 8


SystemML and Apache Spark (Chicago) - Thursday, September 8


Back to School: Hadoop Ecosystem (Saint Louis) - Wednesday, September 7


Apache Kudu 0.10 and Spark SQL (Boston) - Tuesday, September 6


Tensorflow + Kubernetes + Hadoop + Spark (London) - Monday, September 5

Learn about Database DevOps and Hadoop (London) - Tuesday, September 6

Google Cloud Dataproc: Managed Hadoop & Spark on GCP (Reading) - Wednesday, September 7


Spark, Mesos & Automation (Frankfurt) - Thursday, September 8


From Legacy to Spark, What's New in Spark 2.0 (Tel Aviv-Yafo) - Tuesday, September 6


Stream Processing Using Spark Streaming & Kafka (Noida) - Friday, September 9


Apache Spark Meetup (Shanghai) - Saturday, September 10


Apache Spark Meetup (Auckland) - Tuesday, September 6