Data Eng Weekly

Issue #6

24 February 2013

StrataConf and ApacheCon NA are both coming up this week, but we've already seen quite a few announcements in advance of those conferences. In fact, I think this is the busiest week of Hadoop news since I started the newsletter, and there's so much high quality stuff that it's a larger issue than norma. Even so, there were still a few high-quality articles that didn't make the cut. Hope you enjoy what I've decided to highlight!


Axemblr provisioner provides services for starting Hadoop-related systems in the cloud. They've published a tutorial for starting an HDFS cluster on EC2.

Greenplum is hosting an interesting experiment of running a publicly accessible, free of charge Hadoop cluster. And the cluster is nothing to joke about -- with 1,000 nodes and 12,000 disks. Organizations such as NASA are already using the cluster to perform analysis. This post has a good overview of the cluster and some information on Greenplum's motivation for running this system.

Are you looking to get started with Apache Hadoop? Or like me, are people constantly asking you how to learn Hadoop? In either case, George Trujillo has a good list of resources and advice for anyone getting started with Hadoop.

Apache HCatalog (incubating) has made a lot of progress in the past year (see more details about the latest release below). This post highlights, the past, present, and future of HCatalog. It seems like momentum is building, and HCatalog is going to be a big deal.

Kiji is an open-source table and schema management system for Apache HBase. It has a number of features to make hbase schema management much easier, including the ability to add a hash prefix to keys, which can help to prevent hotspots. In this post, Jon Natkins gives a good overview of the HBase hotspot problems, potential solutions, and how this is done via Kiji.

RedHat is rebranding Gluster File System as "Red Hat Storage" and has written a "Red Hat Storage Hadoop" plugin. They plan on open-sourcing this and are betting that it will become a viable replacement to HDFS. Rather than releasing their own Hadoop distribution, they plan on working with various Hadoop vendors to bundle distributions with RHEL. The Register has a quite in-depth and technical overview of the strategy and all the pieces involved.

Hortonworks has announced a number of initiatives with really catchy names. They seem to be reiterating their early statements about going all-in on Apache Hive, which is the center-piece of their new proposals. Lots of coverage (and more about each piece below):

The first project, called "the Stinger initiative," is focused on speeding up Apache Hive -- with an end goal of being 100x faster. Pieces include adding SQL analytics functions (e.g. OVER, NTILE, RANK), improving the query optimizer and planner, a new columnar format (ORCFile), and Tez (more below).

Tez is a new computation framework built atop of YARN, Hadoop's next-gen scheduling and resource allocation system. Since MapReduce has inherent sync barriers and other overhead, the idea is that a new framework is needed to offer true speed. Hortonworks anticipates Tez powering both Pig and Hive, among other things. Tez has been submitted as a Apache Incubator project, and the code will be part of the next alpha version of the Hortonworks Data Platform.

The final part of the Hortonworks trifecta is Knox Gateway, which is a security gateway that sits between Hadoop clusters and clients. Hortonworks anticipates a release by late March.

As you've probably noticed from the last few weeks worth of links, lots of organizations are climbing on the SQL-on-Hadoop bandwagon. I don't think we've seen the last of the announcements in this area, but GigaOm has a good summary of what's out there so far.

Choosing a system for a write-heavy analytics solution can be a major challenge, particularly if you don't have a system that's easily extendible to this new use case. The folks at MarkedUp eventually choose Apache Cassandra (they were migrating from RavenDB), but evaluated MongoDB, Riak, and HBase as well. The analysis has some interesting points -- e.g. Riak's MapReduce limitations, HBase is a beast to setup and configure -- and they also share some interesting data points regarding writes on SSD instances in EC2.

In contrast to implementing an outer join in MapReduce, writing an outer join in Hive is a piece of cake. This is the second in a series of posts demonstrating common MapReduce techniques in different frameworks.


Apache HCatalog 0.5.0-incubating was released. HCatalog brings the Hive metastore, InputFormats, and SerDes to other MapReduce frameworks such as Pig, MapReduce, and Hadoop Streaming. This version of HCatalog uses Hive 0.10, and has a number of new features such as Maven integration and webhcat - a REST API for HCatalog

CitusDB has added support for HDFS to their PostgreSQL-based analytics platform, joining the ranks of Cloudera's Impala, Platfora, Drawn to Scale, Amazon Redshift, et al in the the low-latency SQL-on-Hadoop space. They claim to offer a 3-40x speedup over Hive, via PostgreSQL's foreign data wrappers. This essentially means that they have an instance of PostgreSQL running on each node in the cluster, and they have a master/scheduler node to distribute the work (somewhat similar to Hadapt). Their blog digs into the technical details while GigaOm gives some background info.

Concurrent, the folks behind Cascading, have released Lingual. Lingual is a JDBC and SQL parser that sits atop of Cascading. Unlike many other SQL-on-Hadoop implementations, Lingual supports ANSI-standard SQL.

Apache Pig 0.11.0 was released, and it includes a lot of exciting new features. From the release notes: "This release includes DateType datatype, RANK, CUBE and ROLLUP operators, Groovy udfs, custom reducer estimation, schema-based tuples and HCatalog DDL integration". I'm quite excited about the analytics functions (RANK/CUBE/ROLLUP) -- I've had coworkers asking for these, and it could simplify a lot of computation.


Call for speakers is open for Apache Cassandra Summit which takes place June 11th and 12th in San Francisco.

Call For Speakers is open for HBaseCon2013. Abstract submissions are due on April 15. The conference takes place June 13 in San Francisco.