Data Eng Weekly

Hadoop Weekly Issue #156

07 February 2016

Going along with the theme of this week's super bowl, there are two matchup articles this week: comparing the programming models of Dataflow/Beam & Spark Streaming and a performance comparison of Flink and Storm. The rest of the content this week covers a diverse set of topics and technologies—Spark, Kafka, HDFS erasure codecs, geospatial analytics on Hadoop, and more. It's one of the largest issues in a while, so there should be something for everyone.


The ATS blog has a post describing how they added low-latency querying capabilities to large swaths of historic data by introducing Google BigQuery. They provided a GUI for users to supply query parameters in order to service requests without any SQL.

The GameChanger tech blog has had a number of great articles on Kafka recently. In "Experimenting With Kafka," the author describes experiments for stopping a Kafka server while writing to it, tweaking min.insync.replicas and request.required.acks, and experiments with write throughput. In "Scaling with Kafka," the author details some tips for running Kafka in a dynamic cloud environment (assigning broker ids and using EBS volumes to replace a node without requiring a full data sync). The blog has two other post about Kafka and Docker/AWS if you're itching for me.

This post gives an overview of various tools for doing geospatial analysis on Hadoop. While many tools have nice out of the box support, they aren't necessarily the most performant when comparing two datasets. To speed up these types of queries, the post describes spatial binning. The SpatialSpark library includes support for similar performance improving strategies, but its API and data format support aren't the most thorough. The post concludes with some general recommendations for which libraries to use in a few different situations.

The mapWithState function is a new addition in Spark 1.6 to improve performance of stateful stream processing, which is used for things like session analysis. As a replacement for updateStateByKey, the new mechanism drastically improves latency, supports 10x more keys, and provides a clean API for updating state.

Yahoo recently announced a new benchmark for stream processing frameworks. In a follow up to Yahoo's introductory blog post and preliminary benchmarks, the DataArtisans blog has a look at Storm and Flink on faster hardware/networking and the benefits of some of Flinks new streaming features. With this setup and some application changes, Flink achieves 15 million events/sec on a 10 node cluster. The post also describes changes to remove the external key-value store dependency and instead serve lookups directly from Flink.

The Cloudera blog has an in-depth update on HDFS erasure codings, in which they present a number of performance analysis results. In summary, when using the Intel ISA-L Library's hardware acceleration, CPU is no longer the bottleneck (even with 10Gb ethernet) and overall throughput of HDFS reads and writes (vs. 3x replication) is often increased as well.

KDNuggets has an introduction to Apache Spark's RDD, DataFrame, and Dataset APIs. There are code snippets for each, and the article aims to help decide which API to use in various situations.

While not directly related to Hadoop, this tutorial is tangentially related since there was a TensorFlow and Hadoop post last week. With that in mind, this post describes setting up TensorFlow to use the GPU on an Amazon EC2 instance.

The MapR blog has a guest post on Apache Flink. The post describes Flink's stream processing features by way of an example program that processes NYC taxi ride data (a public dataset from the NYC Taxi and Limousine Commission). In addition to processing with Flink, the post shows how to write data to Elasticsearch and visualize it in Kibana. The source code for the project is available on Github.

JavaWorld has a post on Apache Phoenix, the SQL engine for Apache HBase. The article gives a short introduction to HBase (and its data model), details getting started with Phoenix and its command-line interface, and demonstrates using Phoenix from Java via java.sql.

The Confluent blog has a new edition of their Log Compaction newsletter, which covers updates from the Kafka community. This issue highlights five new Kafka Improvement Proposals (KIP) and has links to several recent articles about Kafka and stream processing.

The Google Cloud Platform blog has a post comparing the Dataflow/Beam programming model to that of Spark Streaming. Dataflow's primitives allow for a concise programs even for fairly complicated tasks, whereas Spark Streaming code is complicated by microbatching. Code and descriptions for several tasks, including computing hourly team scores, computing a leaderboard (per-hour and per-team), and user session analysis are included in the post.

To continue on the theme of stream processing, Hortonworks has a post on windowing computations with Apache Storm. The post (be sure to follow the link at the end) covers topics like watermarks, delivery guarantees, state management, checkpointing, and more. There are several code examples and links to full-blown example topologies.

The AWS big data blog has a post on customizing spark-submit flags to control resources and best optimize cluster utilization. There's a lot of background on how the Spark driver works, memory management in Spark, and dynamic resource allocation.


Apache Beam (previously known as the Dataflow Java SDK) as been accepted into the Apache incubator.

This post introduces the notion of data drift and formalizes the three types of drift common in data systems. They are: structural drift (i.e. schema evolution), semantic drift (where the meaning of a value changes), and infrastructure drift (the evolution of producer/consumer/etc systems).

On the question of Apache Spark replacing Hadoop, it's clear that Spark is gaining wide adoption (vendors are including it in their distributions). It's less clear if Hadoop is going away—in the data center, HDFS is needed. But many of the Spark/Hadoop-as-a-Service vendors are betting on blob stores like Amazon S3. In addition to exploring this friction, this article looks at the reasons that Spark is gaining adoption (usability and features) as well as the future of the project.

This post has a list of ten trends in big data expected for 2016. There are a few expected ones (e.g. Spark, cloud computing) and also a few that aren't as well-known (e.g. AMPLab's reboot, hardware topics)

Hortonworks marked the recent ten year anniversary of Hadoop by releasing a Docker image of "hadoop version 0." There are some tips for trying it out and several notes about the status of the project at that point in time (e.g. you have to restart the namenode every couple of days to compact the edit log).

In another post marking Hadoop's ten year anniversary, the Hortonworks blog has an overview of recent usability improvements in Hadoop by way of six Hadoop labs. Many of the labs used to require a command-line but now have nicer interfaces, and there are a number of new ways to visualize and interpret data.


Amazon EMR 4.3.0 includes updated versions of Apache Hadoop, Apache Spark, Ganglia, and Presto. The announcement has more details and instructions for how to use the new version.

Version 0.1 of ruby-kafka, a new Kafka client written in Ruby, is available. The release focuses on the producer API, and the developer is soliciting feedback around the design of the library to help harden the implementation.

Not really a release announcement, but Netflix has announced that they're retiring Astyanax, their Java client for Apache Cassandra. In the post, they acknowledge the usability and performance advantage of the CQL protocol.

Version 0.6.0 of the Yahoo! Cloud Serving Benchmark was recently released. Cloudera Labs has bundled this version, and a post describes the new features and updates across several recent versions. The highlights include support for Aerospike, Kudu, Cassandra version 2, & Google Cloud Data Store, improved percentile latency reporting, and other usability improvements.

Apache Avro 1.8.0 was released this week. The release contains a large number of bug fixes, new implementations for .NET 3.5 & JavaScript, and new Date/Time data types. There full release notes contain details on over 100 resolved issues.

Version 0.11.1 of Apache Tajo, the big data warehouse system, was released. This release contains a number of bug fixes, several performance improvements, and a few new minor features.


Curated by Datadog ( )



An Evening with Amr Awadallah, Co-Founder & CTO of Cloudera (Los Angeles) - Monday, February 8

Pivoting Spring XD to Spring Cloud Data Flow (San Francisco) - Tuesday, February 9

Learn More about Dataswarm and Ibis (San Francisco) - Wednesday, February 10

Introduction to Apache Tajo: A Big Data Warehouse on Hadoop (San Francisco) - Wednesday, February 10


Exploratory Analysis of Large Data with R and Spark (Seattle) - Wednesday, February 10


Spark and Tachyon: Big Data Utah Meeting @ IHC (Salt Lake City) - Wednesday, February 10


A Deeper Look into SparkSQL, DataFrames, and Data Sources with IBM and Galvanize (Boulder) - Monday, February 8

An Evening with Apache Spark (Denver) - Wednesday, February 10


Protecting Hadoop Data at Rest with HDFS Encryption (Austin) - Monday, February 8


Interactive Visualization + Leveraging Spark in a Hybrid OLTP/OLAP (Reston) - Wednesday, February 10

District of Columbia

Data Pipelines for Data-Driven Apps (Washington) - Monday, February 8

New York

Hadoop & Spark Panel Discussion Series (New York) - Tuesday, February 9


Open Analytics Talks (Boston) - Thursday, February 11


YARN by Default (Barcelona) - Tuesday, February 9


Data NoBlaBla: Data Munging with Spark, Part II (Toulouse) - Thursday, February 11


Parallel Scikit-Learn on YARN and Real Secure Hadoop (Amsterdam) - Thursday, February 11


Spark Streaming: Adventures by AppsFlyer (Herzelia) - Wednesday, February 10


Second IMC Pune Meetup (Pune) - Wednesday, February 10

Introduction to Dataset API (Bangalore) - Saturday, February 13


Spark ML Pipeline and Demo of Databricks Platform + More (Sydney) - Wednesday, February 10

SQL and Machine Learning on Hadoop Using HAWQ (Melbourne) - Thursday, February 11