Data Eng Weekly

Hadoop Weekly Issue #133

09 August 2015

This week's theme is streaming—there are quite a few dense and important articles on the topic. The O'Reilly Radar piece "Streaming 101" and the overview of consistency in stream processing from the data Artisans blog are both must reads. If you're into podcasts, be sure to checkout the episodes of Software Engineering daily covering several ecosystem projects. And rounding out the newsletter, Hortonworks announced Q2 earnings that beat the Wall Street estimates.


This article is a follow-up to a recent post that demonstrated that faster networks (e.g. 10G) can result in improved performance (which is counter to a recent paper that looked at Spark and GraphX). This time, the authors investigated the performance characteristics of PageRank in their timely data flow implementation vs. that of GraphX to explain some of the findings. They also show with some confidence that their timely data flow implementation can't match 10G speed on 1G by making additional compute trade-offs.

This tutorial shows how to use Spark for collaborative filtering. In it, you learn how to use Spark MLlib's Alternating Least Squares and MatrixFactorizationModel to build a model for movie ratings across the MovieLens dataset.

In addition to the various stream processing systems in the Hadoop ecosystem, there are several outside of it. This post describes using Akka actors to replace a MapReduce batch processing system to power a Data Management Platform. It describes both the batch system (which uses HBase and AerospikeDB) as well as the new Akka-based one.

The team form Santander UK has written about their near real-time data system which powers there "Spendlytics" app (which provides trend and aggregation data for an account). The system is built on Flume, Kafka, and HBase. The major functionality is implemented as a Flume interceptor, which uses a local RocksDB store to annotate the data stored in Kafka. To round things out, a Spark job generates the RocksDB stores from historical data in HDFS.

The data Artisans blog has a post describing how Apache Flink achieves high-throughput, low-latency, and exactly-once delivery for stream processing. For background, the post describes how Apache Storm, Trident, Spark Streaming, and Google Cloud Dataflow implement fault tolerance and the trade-offs of each. The post goes on to describe Flink's distributed snapshots which implement a variation of the Chandy Lamport algorithm, show the results of several performance (throughput and latency) experiments, and describes some fault-tolerance experiments that used a YARN chaos monkey to introduce failures.

If the details of Flink's streaming performance and other features seem interesting, you might want to check out Flink's Storm Compatibility project (currently in beta). With it, you can execute a Storm Topology in Flink using Storm Spouts and Bolts.

The MapR blog has a post describing how to use Spark for document classification. In the article, a model is trained on the first chapters of "To Kill a Mockingbird" and "Go Set a Watchmen" using Spark's built-in analyzers and classifiers. The post walks through building features from the text and training/evaluating naïve Bayes, Random Forests, and Gradient Boosted Trees models.

The O'Reilly Radar blog has the first post in a two-part series on stream processing, in which background and terminology are discussed. The post gives a concise definition for stream processing, discusses correctness, introduces the notions of event and processing time, and discusses several data processing patterns. This is a great read for anyone that's working with a stream processing system.

Roughly speaking, the Unix philosophy is "do one thing well and expect the output of every program to be input to another program." Using tools that implement this philosophy, it's easy to build complex scripts by piping data across processes. The same idea can be applied to distributed systems, particularly when Kafka and Samza provide the pipes and programs. This post explores the unix philosophy for data systems in depth.

The MapR blog has two posts on HBase. The first details HBase's architecture, covering all of the major topics like Regions, RegionServers, the HMaster, the META table, the write pipeline, HFiles, and much more. The second article describes how to design a HBase schema that takes advantage of this architecture to perform well across various types of access patterns.

Voidbox is a system for running Docker on YARN used internally at Hulu for automated testing, running applications, and deploying complex workflows. Voidbox has some features not found in YARN's experimental docker support, such as a DAG programming model, multiple run modes, and configurable container-level fault tolerance. The post describes these features (and more) in-depth.

Zulily has recently shifted their data platform from a standalone Hadoop cluster to a Hadoop-based deployment in the cloud. By storing data in Google Cloud Storage, they can take advantage of massive storage space, transient clusters, and Google's Dataflow and BigQuery. The post describes the advantages of and various pieces of this architecture.

The Software Engineering Daily podcast has done a number of interviews this week about several companies and components in the Hadoop ecosystem. Companies and topics covered include Cloudera, Apache Spark, Hadoop operations, Apache Kafka, Apache Zookeeper, the Hortonworks Data Platform, and Presto.


Hortonworks and EMC announced a broadened relationship in which EMC will resell the Hortonworks Data Platform. The two companies previously worked together to ensure robust support for the Isilon file system for Hadoop.

InfoWorld has published an interview with Cloudera's Charles Zedlewski about the definition of the Hadoop platform. The theme is that the platform will evolve over time to contain the best components for data management (the migration from MapReduce to Tez and Spark is a good example). The post also compares Hadoop to the Linux kernel noting that the kernel has added new components (such as the XFS file system) over time, too.

Hortonworks announced Q2 earnings this week. They beat Wall Street estimates of $23.3 million in revenue by reporting $30.7 million (a jump of 154%). They also revised estimates up for the rest of the year. CMSWire has a post about Hortonworks growth in which the author observes that Hortonworks could be the fastest public company to reach $100m in revenue.


A new version of the Databricks hosted Spark service, Databricks 2.0, adds several useful and highly sought features. These include access control for shared clusters, notebook version control (including github integration), and R support.

Apache Mahout released two new versions this week, version 0.10.2 and 0.11.0. The 0.10.2 release introduces new features and bug fixes (over 25 of them, including several new apis and operators). Version 0.11.0 adds Spark 1.3 support, all the features of the 0.10.2 release, and a few additional fixes/features.

Cloudera announced a new release of Cloudera Enterprise and their distribution of Apache Kafka. Cloudera Enterprise 5.3.6 resolves bug fixes for HDFS, HBase, and Cloudera manager, and the Kafka distribution includes six bug fixes.


Curated by Datadog ( )



Real-time Advanced Analytics: Spark Streaming+Kafka, MLlib/GraphX, SQL (Mountain View) - Monday, August 10

Building Ticketmaster's­­ Lambda Architecture: Real-time Machine Learning at Scale (Los Angeles) - Thursday, August 13


Michael Drob: Solr & Cloudera (Houston) - Tuesday, August 11


Using Tessera With Hadoop and Spark for Big Data Analysis! (Grand Rapids) - Wednesday, August 12

New Jersey

Kafka: Ingestion and Processing Pipeline (Hamilton Township) - Tuesday, August 11

Spark Streaming (Princeton) - Thursday, August 13


Hadoop Overview (Vienna) - Thursday, August 13