Data Eng Weekly

Hadoop Weekly Issue #30

11 August 2013

There was quite a kerfuffle this week over some management changes at Hortonworks as well as a controversially post on HBase at InformationWeek. It'll be interesting to see if the news on these two topics die down or bleed into this coming week. In less controversial news, Mortar announced a really cool new project to accelerate development of Pig workflows, and Infochimps announced an acquisition by CSC. All-in-all, it was quite a busy week in the Hadoop ecosystem.


KijiScoring is a project for generating real-time scores (e.g. for recommendation engines or fraud-detection) for single entities. It complements the KijiMR project, which can build scores for entire datasets in batch, by providing a mechanism to refresh scores for individual entities at run-time based upon a freshness policy. This post walks through the key concepts of KijiScoring and gives a simple overview of the ShelfLife FreshnessPolicy that comes with KijiScoring.

One of the main features of Hoya (HBase on YARN) is support for temporary or short-lived clusters, e.g. for the duration of a MapReduce job. Hoya supports this use-case by persisting the HBase data and metadata describing the cluster in HDFS, allowing for automated recreation of a halted cluster at a future state. This post describes how this feature is implemented in Hoya 0.1.

HUE, the web-frontend to Hadoop, has supported querying data stored in Hive for quite some time. Recently, it also added a Pig editor for writing and running Pig queries. This tutorial covers using HUE's Pig support to query data stored in Hive by accessing the data via HCatalog. HCatalog is a system for exposing data stored in Hive to Pig, MapReduce and other frameworks. It's great to see this workflow in action, because just a few months ago getting all of these components to talk to one another was a herculean effort.

Arun from Hortonworks posted a new github project that shows how to build a barebones YARN application written in Java. The application runs the same linux command n-times across the cluster, performing the YARN equivalent of Hello World (particularly if you set the command to be 'echo Hello World'). This is the simplest Java YARN application that I've seen, and it reduces the intimidation in authoring YARN applications quite a bit.

Mortar announced Watchtower for Apache Pig, which provides instant feedback and error handling while authoring Pig workflows. To do this, Watchtower pulls a sample of your dataset to a local server and recomputes all intermediate steps (and shows you the values) in near real-time, exposing the output via a website. Watchtower is a plugin for the mortar gem and is also open-source on github.

The LinkedIn booth at Hadoop Summit featured a Rasperry Pi Hadoop cluster, which was a big hit. This walkthrough covers all the steps necessary to build your own Rasberry Pi Hadoop cluster from scratch, including setting up the right version of Java, upgrading the operating system, and overclocking the Raspberry Pi.

Version 1.1 of the Mongo-Hadoop adapter was released a few weeks ago. This post highlights some of the features of the adapter, such as support for Pig and Hive, as well as background on how the adapter reads data out of Mongo.


InformationWeek published a two-party debate about HBase's roll in the NoSQL market. On the pro-HBase side is Michael Hausenblas of MapR, who argues that HBase is already winning over other popular NoSQL stores such as Cassandra and MongoDB. On the anti-HBase side, Jonathan Ellis of DataStax argues that HBase's design includes fundamental flaws from which other systems (namely Cassandra) don't suffer.

In response to the above InformationWeek article, the Apache blog featured a posted by three Apache HBase committers, Lars Hofhansl, Andrew Purtell, and Michael Stack. The response includes a defense of the HBase project (which was attacked as fragmented), a rebuttal of some of the claims (e.g. that it takes 10 to 15 minutes for RegionServer failover), and a reproof of Hausenblas of MapR's arguments for HBase and MapR's "next version" of HBase, which is a proprietary system with a different architecture from Apache HBase.

Eric Baldeschwieler, aka Eric14, who co-founded and was the first CEO of Hortonworks, has left the company. No news on the details surrounding the departure (there's been speculation it was an unclean break), but the Hortonworks blog featured a post entitled "Wishing Eric Well."

Mike Olson, Chief Strategy Officer of Cloudera, was recently interviewed by LinuxInsider about big data and open-source. There are some interesting topics discussed in the Q&A -- Mike previously worked in the relational database industry, and he draws parallels between the maturity of Hadoop and RDBMSs of the 80s. He also discusses why the enterprise market can't out-innovate open-source.

The DBMS2 blog has a post with interesting details about Hortonworks, Stinger, and Hive. Some highlights include: Hortonworks is ~250 employees with 70-75 paying customers, HBase and Hive are seeing >50% adoption with their users, and more specifics on Hortonwork's Hive projects -- Tez, the ORC file format, and the hive query optimizer. There are also some interesting details on what kinds of hardware is becoming common for Hadoop deploys.

CSC, the large (with 98,000 employees) consulting and professional services company, has bought InfoChimps for an undisclosed amount. Infochimps offers cloud-based solutions for batch (with Hadoop) and streaming (with Storm) processing. They've also published a lot of useful open-source projects over the past several years, such as wukong (ruby libraries for hadoop) and wonderdog (for interfacing between elastic search and Apache Pig). It sounds like InfoChimps will continue to operate independently inside of CSC.