Data Eng Weekly

Hadoop Weekly Issue #120

10 May 2015

HBaseCon and Strata + Hadoop World London were both this week, and presentations and announcements make up much of the content in this week's newsletter. HBaseCon content includes presentations by speakers from Pinterest, Flipboard, Adobe, and Salesforce. In terms of announcements, there were many related to HBase—Google announced Cloud Bigtable (with HBase API compatibility), Cloudera Labs added Apache Phoenix, Qubole announced their HBase-as-a-Service offering, and there's a new version of OpenTSDB. In addition to HBase coverage, there's a new version of Apache Drill, an interesting post about debugging Zookeeper failures, two articles on Apache Flink, and much more.


Apache Ambari contains "User Views" which provide tools and visualizations for a Hadoop cluster. This post on the Hortonworks blog discusses the Tez view, which provides an annotated data-flow DAG for understanding what is happening during a Tez job, and the Capacity Scheduler view for editing the queues and deploying changes (rather than mucking with XML). The post has more about using these tools (especially interesting is using the Tez view to find and fix performance issues). These views and a couple of others are available in Tech Preview.

The Hortonworks blog has the latest in a series of posts about rolling upgrades for Hadoop. This post describes rolling upgrades for MapReduce, which are possible because the MapReduce libraries have moved to the Distributed Cache rather than being pre-installed on nodes in the cluster.

PagerDuty has a very interesting post describing some low-level debugging of recent issues with Apache Zookeeper. While the symptom is a (silent) Zookeeper failure, the cause involves data corruption at the TCP layer (caused by corruption due to IPSec and the aesni-intel kernel module).

The Confluent blog has a guest post that talks about real-time stream processing with Apache Flink. The post goes into details on the components and architecture of Apache Flink streaming (such as pipelining, replay, and state backup & restore). It also describes the Flink APIs, how to achieve high availability, scaling up/down, and how Flink's support for batch and streaming fits with the Lambda and Kappa architectures.

The first of several posts from HBaseCon, this one describes how Flipboard is using HBase. Several Flipboard features, like user-generated magazines, likes, and comments, are powered by HBase. The slides describe their data model for magazine and their social graph—they use JSON serialization inside of HBase cells and combine with ElasticSearch for additional indexes. The post also has information on their deploy, which is in AWS across 15 clusters and 250 region servers

This presentation describes HBase at Pinterest, which is deployed at pretty massive scale on AWS. There are several deployment suggestions—both on AWS (e.g. instance sizes, noisy neighbors) as well as independent of the environment (e.g. tuning JVM args such as GC and Linux file systems). The slides also describe best practices for monitoring, alerting, capacity planning, optimizing for availability (with some specifics to AWS), disaster recover in AWS, and more.

HBase on YARN has been available for some time (via Apache Slider), but this is among first instances I've heard of someone running HBase on Apache Mesos. The presentation describes how Adobe is running HBase via Docker containers on Mesos using the Marathon scheduler. There are details on how the Docker images are built and run as well as plans for improvement in the future (such as scheduling for data locality with HDFS).

This presentation on HBase performance and correctness tuning is the richest collection of best practices and important configuration items that I've seen for any Hadoop ecosystem project. It covers tuning HDFS, tuning HBase RegionServers/column families, sizing RegionServer instances, tuning HBase client settings, and tuning Linux.


An article on VentureBeat details estimates of Cloudera's revenue, profitability, and valuation recently completed by Manhattan Venture Partners, a private tech research firm. Among the highlights, the analysts predict $199 milliion in revenue in 2015 and profitability by 2018 at the earliest.

Apache ORC, which is a columnar storage format for Hadoop, has moved from within the Apache Hive project to be a top-level project of its own. The Hortonworks blog has more information on the project including a link to a post from a couple of months back by Facebook about some performance testing they've done with ORC.

Cloudera and Confluent have announced an initiative to build a suite of tests to "certify API and protocol compatibility between versions and distribution" of Kafka.

Standardizing APIs has been a big topic of recent, especially as many Hadoop ecosystem projects hit version 1.0. This post highlights the work that was done for standardizing the HBase API, and how Google's contribution now makes a lot of sense (given the Google Cloud Bigtable announcement—more below).

Apache Flink has been mentioned as an alternative to Apache Spark in several articles over the past few months. InfoWorld has a great summary of the differences of these two projects in areas like stream processing, memory management, and python apis (note that there is also some follow-up information in the comments).

Slides from many presentations as well as videos of keynotes from Strata + Hadoop World are available on the conference website.


Over 200 issues were resolved as part of Apache Drill 0.9, which was released this week. Key new features include authentication for java/c++/JDBC/ODBC clients, impersonation, ownership chaining, extended json support, avro support, and enhancements to Parquet (columnar storage format) and Calcite (SQL parser).

Pivotal has announced a new version of their distribution, Pivotal HD, which is aligned to the Open Data Platform. In addition, the new release adopts Apache Spark, Apache Ranger, and more.

Cask has announced version 3.0 of the Cask Data Application Platform (CDAP). The release contains a major new feature called Application Templates for implementing common Hadoop workloads like ETLs from various sources to HDFS and HBase. Other features are a new role-based web UI, enhanced metrics, and support for querying CDAP Table datasets via Hive.

Google has announced a beta version of Google Cloud Bigtable, which is a NoSQL-as-a-Service solution that's API-compatible with Apache HBase. Google's Bigtable is the precursor to HBase (HBase was built on the architecture described in the BigTable paper), and Google is boasting its maturity, speed, security, and cost.

Cloudera has announced that Apache Phoenix, the SQL engine for Apache HBase, is now part of Cloudera Labs.

Qubole announced a new HBase-as-a-Service offering that's powered by HBase 1.0.0 and Hadoop 2.6.0. The introductory blog post has more information about the offering, such as optimization (e.g. using ephemeral nodes for compactions), cluster management, and incremental backup/restore to S3.

Apache Lens (incubating), which provides a unified engine for querying data across multiple data stores including Hadoop, has released version 2.1.0-beta-incubating.

Hivemall, which is a machine learning library for Hive, has released a new stable version 0.3.1. This release includes so big changes: the license has changed from LGPLv2 to Apache v2 and Hivemall can be used with Pig 0.15 or later.

OpenTSDB, the timeseries database built on HBase, released version 2.1.0 this week. The release includes a number of bug fixes.

Ivory is a data store for facts and features to be used by a machine learning pipeline. The data store is optimized for scans, as it's backed by files in HDFS or S3.


Curated by Datadog ( )



Joint Seattle Spark and Graph Meetup Extravaganza (Seattle) - Wednesday, May 13


Querying Multiple Distributed Storage Systems with Apache Hive Robustly (Houston) - Tuesday, May 12

Using R and Hadoop for Advanced Analytics at Scale (Houston) - Tuesday, May 12


Special Joint Meetup (St. Louis) - Thursday, May 14


Spark ETL Techniques (Chicago) - Tuesday, May 12


Cask: Accelerating Hadoop Projects with the Cask Data Application Platform (Boston) - Thursday, May 14


Apache Spark (Montreal) - Thursday, May 14


Introduction to Apache Flink (Stockholm) - Monday, May 11


Hands-on Apache Spark! (Paris) - Wednesday, May 13


Hadoop Meetup on the Topic of Security (Prague) - Thursday, May 14


Cloudera Meet and Greet (Budapest) - Tuesday, May 12


Hands-on Labs: Working with Amazon Redshift, EMR & S3 (Tel Aviv-Yafo) - Tuesday, May 12

Apache Spark: How It's Being Used in Production (Tel Aviv-Yafo) - Tuesday, May 12


Big Data Meetup (Bangalore) - Friday, May 15

HDP Operations: Install and Manage with Apache Ambari (Bangalore) - Saturday, May 16

Deep Dive on Spark-SQL (Gurgaon) - Sunday, May 17


Spark Meetup (Shanghai) - Saturday, May 16