Data Eng Weekly

Hadoop Weekly Issue #121

17 May 2015

It seems like every week there is at least one exciting new release. This week, Apache HBase 1.1.0 is atop the list, but there were also new release of Apache Sqoop, Apache Curator, and Apache Knox. For technical deep-dives, there are posts on these projects as well as coverage of Apache Flink, Apache HDFS, Apache YARN, and Apache Spark. In news, Hortonworks reported quarterly earnings this week, and there's a recap of the recent Hadoop Bug Bash.


The Cloudera blog has a post describing a new feature in Apache Hadoop 2.6.0 and CDH 5.4.0—hot swapping of datanode drives. To perform the swap without restarting the DataNode daemon, the system makes use of another new feature, live reconfiguration via the Reconfigurable framework. The post describes how to make these changes via the command line and with Cloudera Manager.

The Apache Flink blog has a post on how Flink manages memory to minimize overhead and GC pressure. Specifically, Flink stores objects in a collection of 32KB MemorySegments, uses custom serializers (which have special support for primitives, arrays, Tuples, case classes, and pojos), makes use of fixed-length sort keys for efficient sorting, and operates directly on binary data whenever possible. In addition, the post shows how this strategy performs for sorting data in comparison to an on-heap array and Kryo-serialization.

This tutorial on the MapR blog shows how to use PySpark and MLlib to target and classify customers in a fictitious streaming audio platform. The post shows how to wrangle data into the proper format and then use MLlib's logistic regression implementations to train and evaluate models.

This post describes some of the confusion created by the terms "consistency" and "availability" when it comes to distributed systems. In particular, these terms have very strong meaning in terms of the CAP theorem—semantics which often don't match what you want in a production system. The post has a clear overview of the terms, and it includes Zookeeper as a case study. It's a good read for anyone working with distributed systems.

The LA Big Data Users Group recently hosted a talk on Apache Ignite (incubating), which is a distributed framework for in-memory data management. Among Ignites many features is a drop-in Hadoop accelerator which will run existing MapReduce jobs in-memory. Both the slides and the video are up on slideshare.

A post on the Scalding blog has an update on running Scalding/Cascading atop of Apache Tez. The post describes a benchmark job that is 20 Cascading Flows (420 steps in Hadoop, 20 DAG in Tez) across 10k likes of Scala. In the two test datasets, speedups are 2.25x and ~18x. Since this test about a month ago, the developers have found and fixed a number of bugs—the post stops just short of saying the integration is production-ready.

Sqoop2 has recently gained support for using PostgreSQL as a repository (in addition to an embedded Derby DB). A post on the blog has more details on the Sqoop2 Repository API, the automated testing to validate the new implementation, and some of the trickier implementation details.

This post describes several features of the upcoming Apache Slider 0.80-incubating release: docker-based deployment, zero-package cluster definition, packaging improvements for dependencies/plugins, and improvements to placement strategies. For the latter, there is a description of the improvements as they pertain to long-lived services like Kafka and HBase, YARN labels, placement escalation, and more. There's also a discussion of some features planned for the future.

The Cloudera blog has a guest post on lessons learned working with Spark. The lessons cover three areas: memory management, data movement, and speed. There are a number of good tips, such as using broadcast variables to do efficient joins between large and small RDDs.

This post on the Hortonworks blog describes YARN's supports for scheduling based on virtual core (vcore) resources (in addition to memory).  The scheduler calculation becomes trickier with multiple resources, which is why the CapacityScheduler added the DominantResourceCalculator. In addition to detailing how the new calculator works, the post describes what the expected impacts of using the DominantResourceCalculator are and how to configure YARN to use it.

The latest release of Sqoop2 supports both simple authorization and authorization via Apache Sentry (incubating). The blog has a post describing how to configure Sqoop2 with role-based access controls using the default and Sentry-backed authorization handlers.

The Apache blog has two posts describing improvements in the latest release of Apache HBase (more details below). The first post describes two improvements to the Scan API: RPC chunking (which improves handling of larger rows) and scanner heartbeat messages (for when a scanner only infrequently returns rows). The second post describes request throttling, which is a new QoS setting in the 1.1.0 release. After enabling the setting, throttles can be set on the user, table, or namespace level.


There are several upcoming conferences in the next few months. Hadoop Summit is June 9-11 in San Jose, Spark Summit is June 15-17 in San Francisco (see link below for a promo code), MesosCon is August 20-21 in Seattle, and Flink Forward is October 12-13 in Berlin (call for abstracts is open now).

Apache Geode is a new incubator project derived from the Pivotal GemFire core codebase. Geode is a distributed, in-memory database.

SCALE has an interview with Kafka architect and Confluent CEO Jay Kreps. The article covers a lot of topics, including the creation of Kafka at LInkedIn, scaling the data platform at LinkedIn, the role of open-source, and Confluent.

Hortonworks reported quarterly earnings this week. Revenue is up 167% year-over-year with a net-loss of $0.77/share, both which beat analyst estimates.

The Altiscale blog has a recap of the Apache Hadoop Global Bug Bash. The event saw contributions from folks in several time zones and resulted in over 100 issues resolved. There are some preliminary plans for another bug bash this fall.


Apache Knox Gateway 0.6.0 was recently released. Among the new features are REST APIs for Storm, caching for LDAP authentication, SSL mutual authentication, and improved support for load balancers. The Hortonworks blog has more on these features.

Pentaho Labs has announced support for Apache Spark. The integration supports unifying existing Spark jobs with the Penthao platform and using Spark SQL engine to power the Pentaho front-end. Pentaho is approaching Spark with a discerning eye—particularly when it comes to multi-tenancy. Datanami has an interview with Penthao's CTO in which he describes some of their concerns.

Apache Curator, which is a java library for Apache Zookeeper, released version 2.8.0. Curator makes working with Zookeeper much easier by implementing a number of best practices and common patterns. The release has a number of bug fixes and improvements.

Apache Sqoop released version 1.4.6 and version 1.99.6 (from the Sqoop2 branch). Both versions include a number of bug fixes and new features (e.g. Parquet support in 1.4.6 and Apache Sentry integration for 1.99.6).

Cloudera Enterprise 5.4.1 was released. The point release contains fixes for HDFS, YARN, MapReduce, HBase, Hive, and more. There are also improvements to Cloudera Manager and Cloudera Navigator.

Apache HBase 1.1.0 was released. This new version has a number of bug fixes and improvements. In addition to the features described in the posts above, the new version has an async RPC client, improved compaction controls, per-column family flush, support for writing the WAL to SSD, and support for using memcached for the HBase block cache.

Hermes is a new project providing a message broker API atop of Apache Kafka. It provides an HTTP API for clients, a UI to simplify common operations, and docker images for quickstart.


Curated by Datadog ( )



Revisiting the MapReduce Paradigm: An R-Specific View (Berkeley) - Tuesday, May 19

Spark Streaming and GraphX at Netflix (Los Gatos) - Tuesday, May 19

Spark Monitoring (Sunnyvale) - Wednesday, May 20


Spark 2: Random Forests at Scale (Portland) - Wednesday, May 20


Intro to Apache Ignite & Semi-Supervised Learning (Denver) - Tuesday, May 19

Options & Capabilities When Deploying R Analytics on Hadoop (Denver) - Wednesday, May 20


Learn about Cloud Elephants: HaaS (Dallas) - Wednesday, May 20


ETL Pipelines with Spark (Chicago) - Wednesday, May 20


Cloudera Product Roadmap and a Special Talk on Spark! (Southfield) - Wednesday, May 20


Doug Cutting at the CHUG (Mayfield Village) - Monday, May 18


Hadoop Ecosystem and Spark (Alpharetta) - Tuesday, May 19


DC Spark Mini-Summit and 1-Year Meetup Celebration (Arlington) - Tuesday, May 19

New York

NiFi and Kafka (New York) - Tuesday, May 19

Storm: A Big Data Tool for Your Small Data Problems (New York) - Wednesday, May 20


First Meeting: Introduction to Apache Spark (Montevideo) - Tuesday, May 19


Special Event with MapR & Ted Dunning (London) - Wednesday, May 20


A Night of Cassandra and Spark at ING (Amsterdam) - Wednesday, May 20


Marcel Kornacker, Impala Tech Lead (Milano) - Tuesday, May 19


First HBase IL Meeting (Tel Aviv-Yafo) - Tuesday, May 19

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit