Data Eng Weekly

Hadoop Weekly Issue #172

30 May 2016

This week's issue is dominated by streaming—Twitter and Cloudera both introduced new stream processing frameworks, there's a post on streaming SQL in Apache Flink, DataTorrent posted about Apache Apex fault tolerance, there's coverage of a new streaming framework from the folks at Concord, and Apache Kafka 0.10 was released. In other news, there was quite a bit of activity at the Apache Incubator—Apache TinkerPop and Apache Zeppelin both graduated to top-level projects, and Tephra entered incubation. Rounding out the issue, there are posts on Apache Spark, Apache HBase, Apache Drill, Apache Ambari, and more.


The DataTorrent blog has a post about fault tolerance for Apache Apex when consuming data from or writing data to files. Apex operates on unbounded streams, so there are some subtle but important details to consider. When using HDFS for output, there are additional complications due to the HDFS lease mechanisms.

The Databricks blog has an overview of the upcoming performance improvements in Spark 2.0 as part of the new Tungsten code generation engine. The post offers illustrative examples that explain how specific code generation can be much faster than generic code, due to virtual function overhead, making better use of CPU registers, and loop unrolling. In addition to the Databricks post, The Morning Paper has an overview of the VLDB paper on which the implementation is based.

StreamScope, which is Microsoft's stream processing system, is another one of the papers covered by The Morning Paper this week. The post pulls out some of the highlights—the throughput/cluster sizes, a look at the programming model (SQL), the time model, the delivery semantics/guarantees, and its use in production at Microsoft.

The Apache blog has a post from the team at HubSpot on tuning the G1GC for Apache HBase. The article walks through the progression of changes that HubSpot tried and how they impacted the stability, 99% performance, and overall time spent in stop-the-world garbage collections. The team used lots of tricks and does a good job explaining intricacies of the GC algorithms. At the end of the post, there's a step-by-step guide to tuning G1GC for HBase.

LinkedIn has a post about some difficult to debug issues with Kafka offset management. The article details the symptoms of two so-called "offset rewind" events, how to detect these types of events with monitoring, and the underlying causes (and their solutions) of both incidents.

The Databricks blog has the third and final part of a series on using Apache Spark for Genome Variant Analysis. The post describes the needed preparation (converting files to Parquet and reading data into Spark RRDs), how to load genotype data, and running k-means clustering on the resulting genotypes to predict geographic population based on the genotype features.

Much of the batch big data ecosystem has progressed from custom APIs back to SQL, so it'll be interesting to see if the same happens to stream processing frameworks as well. In this post, the Apache Flink team has written about their plans for supporting streaming SQL. Flink already has a Table API, and they're using Apache Calcite to add SQL support. For things like windowing, they're planning to use Calcite's streaming SQL extensions. The initial SQL work is targeted for the 1.1.0 release, with much richer support in 1.2.0.

This post gives an introduction to the XML plugin for Apache Drill. The support doesn't yet ship with Drill, but it's relatively easy to compile the jar and configure XML support.

The Hortonworks blog has a brief introduction to the architecture of the Ambari metrics system, which recently added support for Grafana as a front-end for dashboards. The system uses Apache Phoenix and Apache HBase as the system of record, so it scales horizontally.

This tutorial describes how to use Spark SQL on Amazon EMR with Hue and Apache Zeppelin to run SQL queries across tab-delimited data stored in S3. The post finishes off by showing how to save data from Spark to DynamoDB.

The Heroku team has written about their experience with the latest version of Apache Kafka—the introduction of the timestamp field (an additional 8-bytes) led to some counter intuitive performance changes.


The O'Reilly Data Show Podcast has an interview with Michael Armbrust from Databricks about structured streaming in Spark 2.0. The website has an article with select quotes from some of the topics—Spark SQL, the goals of structured streaming, guarantees for end-to-end pipelines, and applying Spark's machine learning algorithms for online processing.

Two big data projects graduated from the Apache incubator this week—Apache TinkerPop and Apache Zeppelin. TinkerPop is a graph computing framework, and Zeppelin is a web-based notebook for data analytics.

Tephra, the transaction engine for HBase, has entered the Apache Incubator. Tephra was original built by the team at Cask, and it's in use by Apache Phoenix in addition to other systems.

TechRepublic has an article about, a stream processing framework written in C++ that aims to fill a high-performance niche in the streaming market.


Apache Avro 1.8.1 was released this week. It contains over 20 bug fixes and improvements.

Confluent has released a Kafka Python client built on librdkafka.

Apache Kafka 0.10 was released with the new Kafka Streams, support for rack awareness and timestamps in messages, improvements for SASL and Kafka Connect, and much more.

Confluent has announced Confluent Platform 3.0, which is based on Apache Kafka 0.10. In addition to the core Kafka features, Confluent Platform has a commercial component that adds a Kafka Connect configuration tool and end-to-end stream monitoring.

Apache Kylin, the big data OLAP engine, released version 1.5.2. For a patch-level release, the 1.5.2 has quite a few new features/improvements/bug fixes including support for CDH 5.7 and MapR.

Twitter has open-sourced their stream processing system, Heron. Heron is Twitter's replacement for Apache Storm, and has a focus on performance, debugging, and developer productivity.

Envelope is a new project from Cloudera Labs that provides a configuration-file based streaming ETL process. Built on Spark streaming, Envelope currently has connectors for Kafka and Kudu.


Curated by Datadog ( )



Monitoring Big Data Streaming Applications & Apache Apex Operations (San Jose) - Wednesday, June 1

Apache Spark 201: Staying Sane in Production Environments (San Francisco) - Wednesday, June 1


Taking Spark to the Clouds and Incremental Updates in ML Models (Portland) - Thursday, June 2


What Is All the Hype about w/ Apache Spark? (Colorado Springs) - Wednesday, June 1


Bootiful Data Integration with Josh Long (Olathe) - Wednesday, June 1


Overview of Open Source Fast Data Platforms and Future Plans (Reston) - Thursday, June 2


Apache NiFi Hackathon and Office Hours (Columbia) - Thursday, June 2

New York

Building Effective Near-Real-Time Analytics with Spark Streaming & Kudu (New York) - Tuesday, May 31

Curb Your Insecurity: Tips for a Secure Cluster with Spark Too - Thursday, June 2


28th Big Data London Meetup (London) - Tuesday, May 31

Databricks Team in London (London) - Wednesday, June 1

HUG UK Meetup @ Strata + Hadoop World 2016 (London) - Wednesday, June 1

Strata + Hadoop World London Meetup (London) - Wednesday, June 1

Flink London Meetup: Strata + Hadoop Special (London) - Thursday, June 2


Big Data, No Fluff: Let’s Get Started with Hadoop #7 (Oslo) - Thursday, June 2


Spark Workshop (Bucharest) - Friday, June 3


Apache Apex: Big Data In Motion (Pune) - Saturday, June 4


Spark Meetup 4 (Hangzhou) - Sunday, June 5