Data Eng Weekly

Hadoop Weekly Issue #19

26 May 2013

There were a lot of exciting announcements this week, including Hortonworks announcing General Availability of the HDP for Windows, and Concurrent announcing its new Pattern framework for machine learning on Hadoop. There are also a bunch of interesting technical articles about recent releases -- Phoenix, HUE, Kiji, CQL, and more. Hope you enjoy!


Phoenix is a SQL layer atop of Apache HBase from Salesforce. The latest release includes support for skip scans, which increase performance 3x-20x over a batched get. Skip scans utilize information about the query's key-range to perform server-side skips over un-interesting parts of the key range (the exact details are a bit more complex, and there's a good overview in this post). In addition to an overview, they have a performance analysis given a few different dataset characteristics.

Syncsort has joined the club of those publishing sort benchmarks based upon TeraSort -- they claim to have improved per-disk throughput with their custom DMX-h Sort solution. I'm unsure of the technical details (aside from those mentioned in MAPREDUCE-2454), but the claims seem interesting (I'm curious why they choose not to compress map output, though, that's a well-understood perf gain).

There's been a lot of tech press coverage of Apache Hadoop's YARN project over the past week. There are some technical details in these articles as well as background of the kinds of problems that YARN is solving, which is a good introduction if you're not familiar with the new Hadoop 2.0 framework.

HUE 2.3 has a new component for writing and running Apache Pig scripts. This post gives an overview of the features, such as syntax highlighting and autocomplete. This is the first web-based Pig editor/runner that I've seen, and it seems to be pretty full featured (with even more features coming soon).

A common pattern in a MapReduce job, which is typically easier said than done, is to write to multiple output directories. Chock-full of diagrams and code samples, this article has an in-depth introduction to using Hadoop's MultipleOutputs.

Datastax recently introduced a new Java driver for Apache Cassandra. From a distributed systems perspective, one of the most interesting parts of the software is its support for multiple load balancing strategies, some of which are aware of data center and hashing strategies. In this post, the author details the testing that was done to verify the functionality of these strategies.

As mentioned last week, KijiREST is a new REST API framework for interacting with KijiSchema. This post is an overview of the structure of the REST API, which is a really important component for exposing Kiji (and HBase) to non-JVM languages and frameworks.

Syncsort announced two new data products this week, DMX-h Sort and DMX-h ETL. In this article, they talk a bit about the ETL product, which provides tools for solving problems which are typically really difficult -- such as joins in which both datasets are large and identifying changes (deltas) between datasets. They have a pre-release test drive available on their website.


WibiData, which employs (and was founded by) several former Cloudera employees, just raised $15 million. WibiData develops the the Kiji framework for HBase and provides tools for enterprises to build big data applications.

Cloudera and VMWare announced that Cloudera Manager Enterprise is certified to run on VMWare's vSphere. Customers can use Project Serengeti (the open-source project started by VMWare) to bootstrap a Hadoop cluster running on vSphere.

Actian, the makers of the ParAccel Analytical Database, announced a partnership with Hortonworks that includes tools to exchange data between Hadoop and ParAccel, with HCatalog for managing metadata. ParAccel is the software powering Amazon's RedShift, so it should be interesting to see if this framework makes its way there, too.


Hortonworks has announced that the Hortonworks Data Platform (HDP) for Windows has reached general availability (GA). The Hortonworks' blog mentions how this is opening up Hadoop to a whole new set of companies and organizations, and they highlight some of the types of organizations that they've seen adopting it.

Concurrent released Pattern, an open-source framework for running Machine Learning algorithms on Hadoop by leveraging Predictive Model Markup Language (PMML). Pattern runs within Cascading, and its support for PMML makes it compatible with numeric frameworks like R.

Cloudera Manager 4.5.3 was released -- it has a few bug fixes.

HUE includes a new Oozie UI (it appears that it can be used stand-alone, outside of HUE) for viewing and rerunning Oozie jobs. Anyone that has ever used Oozie knows how bad the UI is, and this is another step in the right direction. Hopefully someone will contribute a new UI to Apache Oozie itself soon -- it's in dire need of updating.

Apache Accumulo 1.5.0 was released. Accumulo is a key-value store with a similar data model to HBase/BigTable, but it provides cell-level access control. This release includes support for Hadoop 2.0, Pig, and Kerberos-enabled HDFS as well as a large number of improvements and bug fixes.

I missed this last week -- Phoenix 1.2, the SQL on HBase implementation was released. This release includes performance improvements, an optimized top-n query, support for generating Phoenix HFiles from Pig and Mapreduce, and a bunch more.

The Cassandra Query Language (CQL), is an SQL-like language for interacting with an Apache Cassandra cluster. DataStax announced that the CQL driver for Cassandra 1.2+ has hit general availability (GA). This is important because CQL provides a much better interface than Thrift (and also outperforms it), without the need to tune things like thread pool sizes. DataStax is predicting that it'll become the de facto standard API for interacting with Cassandra (which would be good, since there is a lot of fragmentation here).


Curated by Mortar Data ( )

Monday, May 27 An Introduction to Impala – Low Latency Queries for Apache Hadoop (Madison, WI)

Wednesday, May 29 Big Data Debate: Does privacy exist in a Big Data World? (London, UK)

Wednesday, May 29 Distributed Random Forest (San Francisco, CA)

Wednesday, May 29 18th Big Data London Meetup (London, UK)

Wednesday, May 29 Hands on Mahout - Recommendation Engines (Toronto, CA)

Wednesday, May 29 Real-World Machine Learning on Big Data: Which Method(s) Should You Use? (San Jose, CA)

Thursday, May 30 Big Data for Business (Englewood, CO)

Saturday, June 1 Interactive meetup on managing large data set on AWS (Amazon EMR) (Pune, India)