Data Eng Weekly

Hadoop Weekly Issue #179

17 July 2016

Folks seem to be taking advantage of the summer lull to publish lots of technical articles. This week is full of great posts covering stream processing, Apache Spark performance, Google BigQuery, HBase debugging, and Storm benchmarking. There is also a recap of Hadoop Summit and news on several upcoming conferences (DataEngConf, Strata+Hadoop World, and Apache Big Data Europe).


Datanami has an article about Concord, a relative new stream processing system built in C++ and that runs on Mesos. The article mentions a recent benchmark (see second link; and always evaluate a system for your own use case), which shows impressive performance numbers.

On the topic of performance, this post has five tips for improving performance of Apache Spark jobs. Topics covered include caching and checkpointing RDDs, skewed keys, DataFrames, and Datasets.

This post describes how to get started with Apache MiNiFi (although the post says the release is forthcoming, it has since been released), which is a new lightweight version of Apache NiFi that's meant for running close to the source of data.

While there are plenty of open-source options for stream processing, many companies have built their own proprietary systems. Facebook has built several real-time systems—each with its own set of trade-offs. They've written about their systems, and more interestingly, have described design decisions for stream processing frameworks. The design decision section provides an interesting rubric for evaluating all stream processing systems (not just Facebook's).

The latest Jepsen post evaluates VoltDB 6.3. VoltDB implements strict serializability (there's a good overview of what this means in the post), although the Jepsen testing finds bugs in version 6.3. The VoltDB team has already put out a new version with several fixes. As always, if you're interested in distributed systems, this Jepsen post is a great read.

The MapR blog has a tutorial that demonstrates using Apache Spark's implementation of Random Forests. It demonstrates classifying risk of bank loan applications using the APIs (the machine learning library for DataFrames).

Scaling an Apache Kafka Streams application is as easy as adding or removing servers. Although simple from an ops perspective, this post describes all that the framework is doing under the covers to preserve/replicate local state and rebalance across the new set of nodes.

Ravelin has a story about how their use of BigQuery. Use cases include distributed tracing, API logging (including headers and status codes), performance analysis, and ad-hoc investigations. The post also describes the types of use-cases that fit well with BigQuery ("map-reduce-esque" querying of immutable data with a schema) and what doesn't fit well (data with lots of updates/deletes, large order by statements).

Feedly has written about some recently debugging efforts to resolve a latency issue with their Apache HBase cluster. Feedly handles an impressive amount of traffic, and the post includes details about the infrastructure needed to power it (a 31-node HBase cluster with 240TB of storage). The post describes how they recognized, isolated, and resolved a networking-related performance issue.

The Qubole blog describes three techniques for speeding up big data queries—sorted tables, narrow tables, and denormalized tables. The post then describes how the recently open-sourced Quark framework provides the mechanisms for managing some of these optimized structures (notably materialized views).

As a relative young project, Apache Spark is evolving at a fast pace. It seems like only a few months ago that the DataFrame API was the shiny new thing, and the upcoming Spark 2.0 will be adding another API—Datasets. This post describes these two APIs as well as Spark's original RDDs, the trade-offs of each, and when each is most appropriate.

The Yahoo Hadoop tumblr has a post on Dynamic Overcommit—a new feature of Apache Hadoop YARN that improves utilization by assigning additional tasks to worker nodes even when a node's resources have been fully allocated. The article describes the core concepts of the implementation, efficiency gained in testing at Yahoo, and next steps for the feature.

At the inaugural Kafka Hackathon, a team from SVDS built a system that streamed EEG data through Kafka to OpenTSDB (using Kafka Streams and Kafka Connect) for visualization by Grafana. The code is open-source, and this post describes how to compile and run the project.

The Hortonworks blog has a benchmark comparing performance of Apache Storm to 1.0.1. There have been a number of changes between these releases (notably a switch from ZeroMQ to Netty for messaging), and the changes seem to have paid off both in terms of throughput (speedups of up to 30x) and latency (reduced by up to 242x).


In addition to videos, the Hadoop Summit website now has slides from all of the presentations.

This article has an overview of themes from the recent Hadoop Summit, including Hadoop and Beyond, scalability of HDFS and YARN, Hadoop in the cloud, SQL, and streaming. There summaries/links to talks from Twitter on multi-DC HDFS, Hortonworks on multi-tenancy, Expedia on fighting small files, Microsoft on YARN federation, and more.

Qubole and WANdisco have created the Cloudera Migration Program to enable active data replication from on-premises clusters to Amazon S3. This enables both migrations and hybrid deployments.

Forbes has an article about's migration to Apache Kafka to enable real-time processing (and replace their batch system). The article also includes an interview with Confluent CTO Neha Narkhede about how Confluent is trying to solve the challenges of making data useful in large companies.

DataEngConf NYC 2016 has been announced. The call for papers, which ends on August 15th, is now open.

The CFP for Apache Big Data Europe is also now open. The conference is November 14-16 in Sevill, Spain, and the CFP closes on September 9th.

The schedule for the upcoming Strata + Hadoop World, which takes place September 26-29 in New York, has been announced.

This post has a question and answer with committers to and users of Apache Beam (20 folks in total). Questions include "What's your favorite feature of Beam?" and "How would you recommend a new person get started with Beam?"

The Hortonworks blog also has a summary of Hadoop Summit. Their post includes summaries of several keynotes.


kafkastreams-cep is a new library that implements complex event processing via the Kafka Streams APIs. It's still a work in progress but has some working examples.

Microsoft has announced the general availability of Azure SQL Data Warehouse, which is their "elastic, parallel, columnar data warehouse as a service." Built on SQL Server 2016, Azure SQL Data Warehouse offers advanced security and access control, elasticity across both data storage and compute, support for structured and unstructured data, and more.

Apache Atlas, the governance framework for the Hadoop ecosystem, version 0.7.0-incubating was released. The release resolves over 200 tickets.

Apache NiFi 0.7.0 was released. This release includes a number of new extensions (Slack, SNMP, MQTT, Lumberjack, DynamoDB, and more), stability fixes, and a new "NiFi in Depth" document.


Curated by Datadog ( )



The Flink–Apache Bigtop Integration (Palo Alto) - Tuesday, July 19

Stream Computing: The Engineer’s Perspective (San Francisco) - Tuesday, July 19

#OCBigData Meetup #18 (Irvine) - Wednesday, July 20

July Kafka Meetup (Foster City) - Thursday, July 21

Apache Spark Streaming with Apache NiFi and Apache Kafka (San Francisco) - Thursday, July 21

Apache Kudu (Incubating): New Apache Hadoop Storage for Fast Analytics on Fast Data (Palo Alto) - Thursday, July 21


Databricks Community Edition w/ Ilya Reznik (Lehi) - Thursday, July 21


CDC to Hadoop & Migrating Analytics Stack from SAS (Dallas) - Thursday, July 21


Lunch with MapR: Apache Spark, with or without Hadoop! (Kansas City) - Wednesday, July 20


The Truth about Running SQL on Hadoop (Chicago) - Thursday, July 21


Feature Selection w/ Spark: Bridging the Gap Between Data Engineering & Science (Atlanta) - Wednesday, July 20

New York

Scalable Machine Learning with H2O (New York) - Monday, July 18

Scala & Data! (New York) - Thursday, July 21


Graph Processing with Spark GraphX (Cambridge) - Thursday, July 21


Running Spark on Hive or Hive on Spark?! (London) - Wednesday, July 20

Spark, Zeppelin and Cassandra (London) - Wednesday, July 20

July HUGUK Meetup (London) - Thursday, July 21


Apache Airflow Introduction and Best Practices (Amsterdam) - Tuesday, July 19


Apache Apex: Stream Processing Architecture and Applications (Munich) - Tuesday, July 19

SQL on Hadoop and Spark (Nuremberg) - Thursday, July 21


Analytics & Groovy Analytics with Spark (Bucharest) - Thursday, July 21


Big Data Analytics and Beyond with Apache Spark (Bangalore) - Friday, July 22


"Stream Processing with Apache Flink" w/ Flink PMC Robert Metzger (Taipei) - Tuesday, July 19