Data Eng Weekly

Hadoop Weekly Issue #27

21 July 2013

This issue is a little light on technical content, but there are a lot of interesting news and releases. Headlining this week are the release of Pivotal HD GA (the distribution from EMC's spinoff Pivotal Labs), and Cloudera's acquisition of London-based Myrrix.


Apache BigTop provides, among other things, a suite of smoke tests to verify that the Hadoop components (HDFS, MapReduce, Hive, Pig, etc) all work together correctly. Oftentimes, you only care about particular components, and this post explains how to customize BigTop to run tests for specific components and how to customize the test run list.

Apache Hive 0.11 includes a rewrite of the server process, called HiveServer2. The main goals of the new system are to support concurrency and security, which weren't available in the original HiveServer. This post covers the HiveServer2 architecture, client system (which includes a new command line interface), authentication, and gateway features.


In the second part of his recap from Hadoop Summit, Merv Adrian discusses the state of SQL on Hadoop, which was a big theme at the summit. He covers everything from Apache Drill (pre-alpha) to Impala (released last year) to Hadapt (which was announced in 2011). The article acts as a good map and brief history of the landscape, which includes dozens of companies and projects.

Cloudera has acquired London-based Myrrix, maker of tools for large-scale machine learning built with Apache Hadoop and Apache Mahout. Myrrix's Sean Owen joins Cloudera as Director of Data Science in London. It should be interesting to see what happens with some of Myrrix's technology, such as their Myrrix recommender engine platform (portions of which are open-source).

Apache Hadoop originally relied upon trust-based authentication (and many users still use this mode), although kerberos-based authentication was added during the 0.20.x series. Other forms of security, such as auditing and encryption are still under-development and will probably take years to reach the entire ecosystem. GigaOm covers the need for better security in Hadoop and some of the current efforts underway for improving security offerings.

O'Reilly is offering 50% off Hadoop eBooks through July 26th. I own a number of these titles and have found them to be of high quality and very useful.

Cloudera announced that nominations are open for the 2013 Data Impact Awards to be presented at Hadoop World in October. The awards focus on achievements by users of CDH in a few areas: business impact, social impact, community contributions, pervasive user adoption, and integration with existing IT.


Pivotal announced the GA of their distribution, Pivotal HD. it is built upon Hadoop 2.0.2-alpha and supports all the common pieces of the Hadoop stack (Hive, Pig, Flume, etc). In addition, it comes with Spring Hadoop 1.0.0 and has the Hadoop Virtualization Extensions builtin (Spring is part of Pivotal which is part of the same parent company, EMC, as VMWare). Pivotal HD also provides the platform for Pivotal's SQL for Hadoop solution, HAWQ. A single-node VM containing both is available for download.

Hue, the web-based front-end for Hadoop, released version 2.5 this week. The release includes a new app, the HBase Browser, which supports CRUD operations for tables and individual cell entries. This release also includes fixes for Pig, Impala, Oozie, and more.

Apache Hadoop WebHDFS is a Perl implementation of a WebHDFS client. Version 0.03, which includes support for a number of previously unsupported APis (e.g. gethomedirectory, settimes, setpermission) as well as bug fixes and doc updates, was released this week.

Oracle distributes "Oracle R Enterprise", which provides support for using R on Oracle databases. The Oracle R connector for Hadoop (ORCH), as far as I can tell, is a similar project for data stored in HDFS. The latest version, ORCH 2.2.0 add support for CDH 4.3, HDP 1.2, and Apache Hadoop 1.0 as well as an HDFS cache to improve performance for interactive file system navigation. ORCH is available as a free download.

Actually open-sourced a month ago (but I missed it), hdfs2cass is a system for moving data from HDFS to Cassandra using MapReduce. It can build SSTables as part of the job and implements a custom partitioner that is aware of the cassandra topology.

RainStor announced version 5.5 of their database system, which they're calling the 'first-ever enterprise grade database for Hadoop'. The RainStor system provides lots of security features, SQL, and free-text search over data in HDFS. The system can be deployed alongside Hadoop and has support for many different distributions.

Version 0.2 of the Savana project, which focuses on running Hadoop on OpenStack, was released. There are a bunch of new features in this release, such as pluggable provisioning (Apache Amabri, Cloudera Manager, etc), cluster resizing capabilities, Cinder block storage, and anti-affinity (to make sure that only one datanode is run on each physical machine).