Data Eng Weekly

Hadoop Weekly Issue #41

27 October 2013

On the heels of last week’s numerous releases, Hortonworks released the Hadoop 2.x-base version of their distribution, HDP 2.0, this week. There were also a number of announcements from other companies in the lead up to this week’s Hadoop World/StrataConf in New York. I’m looking forward to lots of great presentations, discussions, and announcements at the conference, and I’m looking forward to tons of content for next week’s issue.


In YARN, an Application Master can enumerate resources to be distributed to the containers that run user code. This provides a mechanism to distribute runtime dependencies (e.g. jars, sharedlibs), which is missing from other resource scheduling frameworks like Apache Mesos. A post on the Hortonworks blog covers so-called “LocalResources,” including how they’re defined, their lifecycle, and the various flavors (file/archive/attern and public/private/application). As folks ramp up on YARN, this is going to be an important component to understand (similar to MapReduce’s DistributedCache).

Prashant Kommireddi of has a post describing getting started with Parquet, the columnar storage format for Hadoop. The article covers the basics of using the format such as reading and writing Parquet data with Apache Pig, benchmarking on a small dataset, data sizes vs. plain text, and projection pushdown. It also covers some more advanced details, such as schema management, summary files, and Hadoop compatibility. This post is a great resource if you’re interested in getting started with Parquet.

Apache HBase 0.96.0 was released last week, and there have been a number of blog posts with details on the many changes and improvements in the release. The best summary I’ve found comes directly from the Apache HBase Blog. The post covers the breadth of major changes from scaling to stability to operational improvements. It also does a deep dive into some new features such as improved mean time to recover (MTTR), cross-version compatibility and upgradeability, support for namespaces, the new region balancer, and the new cell api. There’s also coverage of the incompatible changes and the upgrade process.

Adam Laiacano of Tumblr gave a talk on digital signal processing in Hadoop at the NY Machine Learning Meetup. Adam works as a Data Scientist and Engineer at Tumblr, and he worked on signal detection systems before that. His talk marries those two concentrations, and he gives a good overview of digital signal processing for those not familiar with it. The Hadoop portion of the talk focuses on Scalding and applying its matrix library for digital signal processing. g33ktalk has a video and the slides.

The Hortonworks blog has a summary of the new features and improvements in Apache Pig 0.12. Those include the ASSERT operator, streaming UDFs, the rewritten AvroStorage, the IN operator, the CASE expression, and more. Very useful for Pig users considering an upgrade.

The Cloudera blog has a post describing how HBase uses Zookeeper to store state and for coordination. It covers the znodes used for main functionality such as RegionServer registration, master registration, and shutdown. It also covers how HBase uses zookeeper for security, replication and online snapshots.

Mark Miller of Cloudera posted slides on the architecture of Cloudera Search, which was built by integrating Solr and Hadoop. The slides cover implementing the Lucene Index and Transcation Log interfaces atop HDFS, Solr replication on HDFS, MapReduce index building, Flume integration, HBase integration, Morphlines, and more.

MapReduce v1 had a feature to reuse JVMs across tasks from the same job, but it was rarely used in practice due to bugs related to shared state. Apache Tez is hoping to overcome these issues and enable JVM reuse to speed up execution by doing things like reusing an object registry (useful for joins) and amortizing JVM startup cost. The blog post has a lot more details about the implementation including scheduling and compatibility.

Elasticsearch is getting richer support for the Hadoop ecosystem, including the ability to index data from Apache Pig. This post covers using Apache Pig, the Natural Language Toolkit (a python package), and the elasticsearch PigStorage function to process data from into bigrams which are loaded it into elasticsearch.

John Russell has written a book on Cloudera Impala. Russell writes the documentation for Impala, but he comes from a non-Hadoop background. In his book, he attempts to take a different approach to documenting Impala that is aimed at a broad audience rather than earlier adopters. In a blog post on the O’Reilly site, he introduces the book, and there’s a link to download a free copy of the eBook.


Sqrrl has raised $5.2M in Series A financing. Sqrrl develops Sqrrl Enterprise, which is a Apache Accumulo and Hadoop-based data storage and processing platform. It’s focussed on security via Accumulo’s cell-based authorization. Based in Cambridge, MA, Sqrrl has customers from government organizations, the financial services, healthcare, and more. Alongside the Series A news, Sqrrl Enterprise 1.2 was released.

Hortonworks, SAS, and Teradata have partnered to introduce “Analytics Advantage Program with Hadoop.” The partnership adds Hortonworks to the SAS-Teradata partnership which was offering “Analytic Advantage.” Best as I can tell, this is bundling SAS with the Teradata Appliance for Hadoop, which is powered by Hortonworks’ distribution.

MicroStrategy announced the MicroStrategy Analytics Platform, which is an update to their Enterprise data offering. As part of this release, they’ve added support for a number of Hadoop distributions: Intel’s Distribution for Apache Hadoop, Hortonworks Data Platform 1.3, and Pivotal HAWQ (from PivotalHD).


On the heals of the Apache Hadoop, HBase, Hive, and Pig releases, Hortonworks has announced the General Availability (GA) release of the Hortonworks Data Platform (HDP) 2.0. The release includes a number of recent Hadoop-ecosystem release, including Apache Hive 0.12, Apache HBase 0.96.0, Apache Pig 0.12, and Apache Ambari 1.4.1. The post has a full list of software components, as well as some more background on the distribution.

GigaOm recaps several announcements this week in the lead up to Hadoop World. Coverage includes releases and partnership announcements by Savvis, Virtustream, Splice Machine, Pivotal, Skytree, 0xdata, and Platfora. Those announcements cover everything from Hadoop in the cloud to BI tools.

Phoenix, the SQL-over-HBase framework, released version 2.1 this week. The release contains a number of new features such as Row Value Constructors and a map reduce-based CSV Bulk Loader. The biggest new feature, though, is secondary indexing of a Phoenix table. The indexing supports multiple columns and is automatically-taken advantage of by the query optimizer. It has two flavors, a server-side system for mutable data and a client-side index for immutable, append-only use cases. The blogspot post has full details, and the Apache blog post has a tutorial for getting started with Phoenix.

The Hortonworks blog has a post on Apache Ambari 1.4.1. The main feature of this release is full support for Hadoop 2 including HA NameNode, YARN, and MapReduce on YARN. In addition, the new version adds support for Kerberos on Hadoop 2, SSL enabled Hadoop daemons, web authentication for Hadoop daemons, and support for JDK 7.

rbhive is a ruby gem for executing Hive queries via the thrift interface from HiveServer or HiveServer2. It released version 0.5.0, which includes support for Hive 0.12.

Hivemall, the collation of machine learning algorithms for Hive, released version 0.1. This is the first release since the project was announced, and it includes improved Hive compatibility, many bug fixes, and new classifiers (Confidence Weighted, AROW, Soft Confidence Weighted) since the original project announcement.

Version 0.8.1 of the Cloudera Development Kit was released. It includes a change to the Morphlines Library to make the query and xslt commands compatible with woodstox-3.2.7.


Curated by Mortar Data ( )

Monday, October 28

Strata NY Fall 2013 Conference Meetup (New York, NY)

Strata + Hadoop World NYC 2013 (New York, NY)

Druid: An Open Source Real-time Analytical Data Store (New York, NY)

Strata/Hadoop World HBase Meetup (New York, NY)

Strata/Hadoop World NYC HUG (New York, NY)

Tuesday, October 29

Apache Flume Meetup @ Strata + Hadoop World 2013 (New York, NY)

Getting Started: Installing Spark/Shark on EC2 & Executing Jobs and Queries (Boston, MA)

Strata/Hadoop World YARN Meetup (New York, NY)

Pig 0.12 + Pig on Tez (@ Strata + Hadoop World) (New York, NY)

Wednesday, October 30

Sentry Meetup at Strata + Hadoop World 2013 (New York, NY)

Cassandra Disk Drive Performance Benchmarks at Gnip and Cassandra v2.0 (Boulder, CO)

Thursday, October 31

De-duplicating the Facebook object graph with Hive & Interactive Programming with Hive (London, England)