Data Eng Weekly

Hadoop Weekly Issue #18

19 May 2013

Both Apache Hadoop and Apache Hive crowned new releases this week, and there are a number of interesting technical articles covering YARN, NFS access to HDFS, and Apache Flume. With so much happening so quickly in the Hadoop-ecosystem, it can be a difficult to keep up -- so please let me know if I missed anything, and I'll include it next week.


Apache HDFS is getting support for the Network FileSystem (NFS) protocol. This an exciting new feature, and one of the authors working on the feature details the what, why, how, and when of Hadoop's NFS support which is being developed in trunk.

Cloudera's blog has the second in their "meet the founders" series. This post features Roman Shaposhnik who founded and works on Apache BigTop. Aside from having one of the best names of projects in the Hadoop ecosystem, BigTop is beginning to have a lot of influence in making sure that components in the stack are compatible with one another when released.

If you've ever tried to put together a patch for Hadoop, it can be very intimidating (and a slow process) just to configure your development environment. This post provides an overview of setting up Eclipse for developing Hadoop -- covering all the major versions and flavors under development.

You might remember the "stinger initiative" which was introduced a while back by Hortonworks with the goal of making Hive 100x faster. With the release of Hive 0.11 (more below), they summarize some of the work that's already been done towards this goal, as well as some of the new features in Hive 0.11 (such as RANK and other analytical functions).

The Manning Early Access Program (MEAP) is now available for the new book, "Pig in Action", by M. Tim Jones. With MEAP, you pre-order the book but get access to the content as the author is writing and uploading it.

Apache Flume is a system for transferring data from application servers or other event-generators to HDFS or HBase. In this post, the author gives an overview of the Flume architecture -- both at the component and system scale.

The Natural Language Toolkit (NLTK) is a set of python libraries for natural language processing. This post describes how to tie them to Hadoop MapReduce for parallel processing using Hadoop Streaming.

Arun C. Murthy, one of the leads on the Apache Hadoop YARN project, gives an update on the progress of the project plus background on what YARN can enable. In particular, YARN turns Hadoop into a multi-application system, allowing more than just MapReduce to run on Hadoop. Arun highlights that we'll be able to run SQL in Hadoop rather than SQL on Hadoop (via MapReduce).

Hortonworks has compiled a list of links for the Hadoop on Windows developer. In particular, the .NET SDK, the Microsoft Hive ODBC driver, and HDInsight's Preview (Hadoop on Azure).

Platfora's product provides a combines a UI and low-latency data store to do interactive analysis on data stored in S3 or HDFS. If the data isn't already in Platfora's store, the system can generate a MapReduce job to load the data. This article gives a good overview of how all of the technology components in the Platfora system work together.

HUE provides a UI for interacting with Hadoop, Hive, Pig, and more. This post describes how to leverage HUE's python API to execute queries against Hive (via HiveServer2) or Impala (which must implement the same Thrift API).

Storm is sometimes called the real-time version of MapReduce. With a lot of interest in getting Storm running on YARN, now's a good time to get familiar with the system. The inaugural London Storm Meetup featured an overview of Storm as well as a discussion of the presenter's use-case. This post has a summary of the event, including links out to the presentation and code examples.


Contexti and MapR have joined forces to provide training, consulting, and professional services for MapR's distribution in Asia-Pacific.

Drawn to Scale, a SQL-on-Hadoop vendor, has announced that they're closing their doors. They had an interesting system, which is built to be performant on many types of SQL operations, and they even had a compatibility layer for MongoDB. It should be interesting to see what happens to that team and their technology.

Concurrent and MapR announced that Concurrent's Cascading framework is now certified to run on MapR's distribution.


Hadoop 1.2.0 featuring DistCP v2 backport, web services for the JobTracker, the offline image viewer, and a bunch of other enhancements.

WibiData announced Albacore/BentoBox v1.0.4. This version has some new features, including a whole new component -- KijiREST, which provides a REST interface to KijiSchema.

Hive 0.11 was released with over 350 Jira issues closed. This is the first release since HCatalog was integrated as a subproject of Hive, and it has a bunch of new features such as HiveServer2, ORCFile, and analytics functions.

Talend's Open Studio was updated to version 5.3 a few weeks ago. This post has a quick overview of the new features, which include a new integration with Apache Pig, as well as support for Amazon's Elastic MapReduce and RedShift.


Curated by Mortar Data ( )

Monday, May 20 MySQL to Cassandra: Big Data, High Scale, Data Migration... Oh My! (New York, NY)

Monday, May 20 Automating the Hadoop Stack (Los Angeles, CA)

Tuesday, May 21 Data & Drinks - Member Networking Meetup (New York, NY)

Tuesday, May 21 Recommendation Engines & Accumulo (Denver, CO)

Tuesday, May 21 Thoughts on machine learning (New York, NY)

Tuesday, May 21 How we use Scala on Hadoop @ eBay (New York, NY)

Wednesday, May 22 Cloudera Impala: An Open Source Real-Time Query Engine for Apache Hadoop (Boulder, CO)

Wednesday, May 22 Big Data, NoSQL, Now What? (New York, NY)

Saturday, May 25 Big Data Science Meetup Event (Fremont, CA)