Data Eng Weekly

Hadoop Weekly Issue #140

04 October 2015

There were two major conferences this week—Strata + Hadoop World and Apache Big Data Europe. As a result, there were a number of announcements and new projects/products released this week. Also, conference organizers have published slides from many of the talks if you want to catch up on what you missed. Major highlights include Kudu (a new storage engine from Cloudera), an update from ODPi, and MapR-DB's new support for JSON. In addition to all the announcements, there are some interesting technical articles about managing multiple Hadoop clusters, deep learning on Hadoop, and the potential damage of an unhealthy node in a distributed system.


Etsy recently added a second Hadoop cluster to their internal infrastructure, and they needed a mechanism to distribute load of ad hoc jobs while reserving the ability to pause job submission to one or both clusters. This post describes their solution, which was built with a custom "State Service" and Apache Oozie.

The Hadoop team at Yahoo has written about how they use Hadoop for deep learning. Jobs are scheduled via YARN on specialized CPU/Infiniband machines (via YARNs node labels), and they run the Caffe deep learning framework atop of Spark.

Twitter has written about how they have customized ViewFs to provide a single-logical view of HDFS across clusters and data centers. The post also describes several clever tricks (such as a logical "local" namespace that simplifies CLI interaction), and how they achieve high availability for multi-datacenter clusters with Nfly (which presents a FileSystem to synchronously write data across data centers).

The Databricks blog has a post on frequent pattern mining in Spark. Version 1.5 adds new support for generation of association rules and parallel sequential pattern mining.

Amazon EMR supports Presto for querying data in S3. This tutorial describes how to configure an EMR cluster with Presto and Airpal (a web-based query tool for Presto). To configure Airpal, the tutorial provides a CloudFormation template.

As many folks working with distributed systems know, a sick (but still running) node can cause a lot of trouble. This post summarizes some research papers that aim to quantify the effects of "limping" hardware in distributed systems. Among others, Hadoop, HBase, and Zookeeper are covered, and some approaches for remaining performant when hardware starts limping are discussed.


GetInData continues to publish a weekly quiz covering the content of each Hadoop Weekly newsletter. It's a good way to make sure you're extracting the important bits from articles.

Apache Big Data Europe was this week in Budapest. Slides from many of the presentations were posted. Several components of the Apache Hadoop ecosystem are covered, including HBase, Hadoop, Ignite, Phoenix, Lens, Cascading, Tez, HAWQ, Kafka, and Spark.

Videos from last week's Strange Loop conference have been posted online. Topics covered include Kafka, MapReduce, distributed system design, transactions, and stream processing.

Strata + Hadoop World was this week in NYC, and the presenter slides have been posted. There are a ton of slides covering many topics and industries.

The ODPi has announced an initial core specification and reference implementation. Additionally, they've announced an open governance model and that corporate membership in the initiative has doubled since the initial announcement in February.

Pivotal has recently open-sourced two projects by way of the Apache incubator. HAWQ is their columnar SQL engine and MADlib is a machine learning SQL library for HAWQ, Postgres, and Pivotal Greenplum.

Altiscale has announced that they've achieved PCI and HIPAA compliance for their Big Data-as-a-Service platform. The announcement contains more information about the certifications and security features of Altiscale's platform.

SiliconANGLE has an interview with Merv Adrian of Gartner. They discuss trends from the Strata + Hadoop World conference, barriers to adopting Hadoop, numbers from a recent Gartner survey, and more.


Gobblin, which is LinkedIn's open-source project for data ingestion to Hadoop, released version 0.5.0 this week. This is the first release with Apache Kafka integration, and LinkedIn has already started transitioning from Camus to Gobblin internally.

RecordService is a new open-source project from Cloudera. The system provides a level of abstraction that sits between compute and data layers, which simplifies integration (i.e. a single api rather than separate input/output formats for HDFS, HBase, etc) and provides fine-grained security enforcement (column/row-level). Cloudera plans to transition the project to the Apache incubator.

Altiscale has released Altiscale Data Cloud 4.0, which adds major upgrades to core components in their Hadoop-as-a-Service product. Additionally, the new version adds simultaneous support of multiple versions of Spark.

MapR has added native support for JSON to MapR-DB. A post on the MapR blog provides examples of how the new Open JSON Application Interface works.

Hortonworks DataFlow, which is powered by Apache Nifi, is now generally available.

Cloudera has officially unveiled Kudu, a new Hadoop storage engine that's been in development for three years. The Cloudera Vision blog describes the goal of Kudu, namely to be high-performance for both random and sequential reads/writes. The Developer blog describes implementation details of Kudu—it stores tables of structured data which are chunked into Tablets and uses the Raft consensus algorithm for replicating Tablets. Many more details are available in a white paper, and the code for Kudu is on github. Kudu is in public beta.

Microsoft has announced general availability of Azure HDInsight for managed clusters running Linux. The announcement also describes the upcoming Azure Data Lake Store and the new Azure Data Lake Analytics.

Apache Spark 1.5.1 was released this week. It fixes a number of bugs in the 1.5.0 release.

Apache Kafka was released this week to resolve two critical bug fixes related to compression with the release.


Curated by Datadog ( )



October 2015 HadoopSF Meetup (San Francisco) - Tuesday, October 6

Deep Dive: Spark SQL + DataFrames + Data Sources API + Parquet + Cassandra Connector (San Francisco) - Tuesday, October 6

Flying Faster with Twitter Heron (Mountain View) - Tuesday, October 6

How Spark Beat Hadoop at 100TB Sort (Mountain View) - Wednesday, October 7

Hadoop & Big Data (Los Angeles) - Thursday, October 8

HBase Meetup at Salesforce (San Francisco) - Thursday, October 8


Special Event: Cloudera Sessions (Phoenix) - Thursday, October 8


Architecting Applications with Apache Hadoop (Ft. Collins) - Thursday, October 8


Real-Time Streaming with Storm and Kafka (Addison) - Monday, October 5


Riak, Redis, Apache Solr, and Spark: Deploying with Basho (Vienna) - Wednesday, October 7

New York

Continuous Streaming Analytics (New York) - Tuesday, October 6


Presto, an Open Source SQL Engine for Big Data (Boston) - Tuesday, October 6


Standardization of Hadoop and the Impact of the Introduction of Spark (Sao Paulo) - Tuesday, October 6

IRELAND Big Data Case Studies: AIB, Recommenders at Scale, Data Catalogs (Dublin) - Tuesday, October 6


October Meetup in Mannheim (Mannheim) - Monday, October 5


First Spark Meetup (Pune) - Thursday, October 8

Real-time Analytics with Apache Spark (Hyderabad) - Saturday, October 10

Apache Spark Introduction and RDD Basics and Deep Dive (Bangalore) - Saturday, October 10