Data Eng Weekly

Hadoop Weekly Issue #137

06 September 2015

Likely due to the long weekend in the US, this week's issue is bit lighter than usual. With that said, there are some great technical articles on performance optimization in big data systems, Apache Drill, Apache Kafka, Apache Flink, and Apache Hadoop YARN. Also, Flink and Impyla both released new versions this week, and there's some open-source new related to HAWQ and Spark for mainframes.


Often, it's best to pick the simplest tool that will get the job done. For the author of this post, Jenkins was the best solution for scheduling and running MapReduce (and other) jobs. The post also describes some of the limitations of a Jenkins-based system.

This post has an in-depth look at how several big data frameworks optimize performance on the JVM by minimizing serialization and garbage collection overhead. There's a great introduction to these problems, the history and evolution of serialization in the JVM ecosystem, and the individual approaches and details of the Flink, Spark, and HBase implementations.

The MapR blog has an in-depth look at the architecture of Apache Drill. The guide describes the Drill daemon (drillbit), query execution/planning/optimization, Drill's pluggable architecture (which supports custom input sources like MongoDB), and several of Drill's optimizations.

The dataArtisan blog has a post about the Kafka and Flink integration. It briefly describes the Kafka architecture, shows how to start a single-node test Kafka cluster, gives some examples of consuming/producing data from/to Kafka from Flink, and address several frequently asked questions (e.g. how does exactly-once work, how does Flink address backpressure?).

LinkedIn has published an in-depth piece about their usage of Kafka. The post covers the scale of their Kafka deployment (over 1 trillion messages per day) and several of the key focus areas for Kafka at LinkedIn (these include quotas, the new consumer, MirrorMaker improvements, security, cluster operations, and much much more). LinkedIn uses Kafka to power systems at such high scale that they experience a number of challenges that few other companies face.

GetInData has a new "Big Data Weekly Quiz." The first quiz tests knowledge of information from last week's Hadoop Weekly.

The Cloudera blog has the first post in a multipart series on YARN. The post acts as an introduction, and it describes the cluster basics (i.e. ResourceManager, NodeManager, ApplicationManager), YARN configuration, YARN resource definitions (vcores and memory), and an example YARN application lifecycle.

The MapR blog has a tutorial for hooking up Spark streaming to Apache HBase. The implementation uses the HBase/MapReduce TableOutputFormat coupled with Spark's saveAsHadoopDataset. The example code is written in Scala, and there are instructions for running on the MapR Sandbox.


The Cloudera Impala team has started a development blog highlighting new features and other important news released to the project.

The full agenda for the upcoming Spark Summit Europe, which takes place in Amsterdam from October 27 through the 29th, has been posted. The Databricks blog has highlighted a number of talks/trainings and also includes a discount code for registration.

Hortonworks and NEC announced a partnership in which NEC will resell support services for the Hortonworks Data Platform in Asia.

Fortune has an analysis of the recent post by LinkedIn about their Kafka deploy. The article looks at what the industry should take away from LinkedIn's experience and scale. It notes that many internet companies are using Kafka, and that many more will use it as they scale up collection and analysis of sensor data and other high-frequency sources.

HAWQ, the SQL-on-Hadoop engine that evolved from Pivotal Greenplum, has been accepted into the Apache incubator. HAWQ supports ANSI SQL, and it supports data in the Parquet and Avro serialization formats on HDFS.


Cloudera Labs has added support for the Yahoo! Cloud Serving Benchmark. With this, it's pretty easy to run performance tests against a HBase cluster (which is described in the blog post).

Syncsort has open-sourced a Spark to IBM mainframe connector. An article on Fortune notes that a number of organizations still make heavy use of mainframes, and this new connector will let them analyze data in new ways. The github project has examples for getting started with the new connector.

Apache Flink 0.9.1 was installed this week. The new release fixes 38 issues from the last major release, 0.9.0.

Syscol is a new Mesos framework for collecting machine metrics from the instances in a Mesos cluster and publishing them to a Kafka topic.

Impyla, the Python client for Impala and Hive, released version 0.11.0. The new release adds Hive compatibility via HiveServer2, including asynchronous query execution. The release also includes a number of bug fixes and improvements.

Qubole, the Hadoop-as-a-Service vendor, has added support for AWS IAM roles.


Curated by Datadog ( )



Building a System for Machine and Event-Oriented Data, and Analytics with Rocana (Redwood City) - Wednesday, September 9


Large-Scale Data Processing with Spark: The Scala Killer App? (Baltimore) - Thursday, September 10

New York

How to Utilize a Super-Stack: Storing & Searching with Riak, Redis, Solr & Spark (New York) - Thursday, September 10


An Introduction to Mesosphere Infinity (London) - Monday, September 7


Using Apache Spark (Prague) - Wednesday, September 9


Ariel Moshkovitz: Kafka Bad-Time Stories (Netanya) - Thursday, September 10


Mumbai Spark Meetup 3Q2015 (Mumbai) - Saturday, September 12

Quarterly Large Scale Production Engineering Meetup (Bangalore) - Saturday, September 12

SOUTH KOREA Spark, Hive, Isilon (Seoul) - Wednesday, September 9