Data Eng Weekly

Hadoop Weekly Issue #24

30 June 2013

Hadoop Summit was this week, and this issue covers a lot of the announcements and presentations from the summit. Thanks to everyone who presented-- it was great to hear about all the different parts of the Hadoop ecosystem! I've gathered major news and technical presentations (I'm sure there will be more posted online soon), but I wasn't able to include everything due to the sheer volume of announcements made this past week. I'm sure you'll enjoy what I've found, though!


Hoya is a new project out of Hortonworks to deploy Apache HBase on YARN. This is an interesting idea, and the blog post outlines a number of potential use cases. For instance, running two versions of HBase on the same YARN cluster or spinning up a HBase cluster for the duration of a MapReduce pipeline (to use HBase as a cache). Code hasn't been released yet, but Hortonworks plans to post on github and to write more blog posts about Hoya in the future.

If you're using Apache Pig, then you should be using (if you're not already) the DataFu collection of UDFs and UDF-libraries from LinkedIn. This presentation provides an overview of DataFu including both concrete UDFs and a bit of background on some of the extensible library APIs.

Cloudera Search uses Apache Solr to provide search on data in HDFS and HBase. It supports both bulk indexing and real-time indexing with Flume. This post is a detailed look at the features and architecture of Cloudera Search, which is currently in beta.

The second post in the series on KijiRest covers HTTP PUT (to replace or add a row) and POST (to add a new row) in addition to GET operations. The post has a bunch of examples, which is probably all you need to get started with Kiji in your favorite non-JVM language.

Apache Oozie has a REST API for CRUD operations -- in fact, but the UI and the command-line are powered by it. This tutorial walks through how to use the API including in a secure installation.

ORC File or "Optimized RC File" is a new file format in Hive 0.11. ORC improves query speed and storage efficiency by storing data in column-major order, maintaining statistics about columns, and doing light-weight (run-length, dictionary, and bit-packed) encoding. This talk explains the file format as well as future work for Hive 0.12, including vectorization which will operate on many values for a single column to improve query speed.

Parquet is another file format with similar design goals to ORC. Cloudera and Twitter are both throwing weight behind it, with support in Impala and Pig (Hive support isn't merged to master, yet). These slides contain an overview of the format, and also touch on the differences between ORC and Parquet.

Hortonworks and Pearson Education have posted a two chapter preview of the upcoming book on Apache Hadoop YARN. The first chapter is a YARN quick start while the second chapter covers the components of YARN and how it fits into the Hadoop ecosystem.


Hortonworks announced a $50 million round of funding. The round included all existing investors and two new ones. This brings Hortonworks total funding to at least $70 million.

The Hortonworks blog's "Week in Review" includes highlights from Hortonworks-related announcements this week as well as links to recaps of each day of Hadoop Summit. Videos of the keynotes for Day 1 and Day 2 are also available. It links out to a bunch of great content, so don't forget to click around!

WANdisco announced that their distribution will support Spark, the in-memory processing framework, and Shark, the Hive compatibility layer for Spark. As far as I know, they're the first Hadoop distro to include Spark/Shark, which is a big vote of confidence in a crowded field of SQL-on-Hadoop solutions.

CMSWire has a recap of Hadoop Summit, including some notable open-source contributions from InMobi and Netflix, details on the discussions surrounding YARN, and some of the partnership news.

An overview of how the enterprise database industry is being shaken up by both NoSQL and Hadoop. There are some details in here that I hadn't heard before -- from IBM pushing MongoDB to rumors of Microsoft and Intel trying to buy Hortonworks.

Silicon Angle covers an interview with Arun C. Murthy about YARN from Hadoop Summit. The post embeds a video interview with Arun, in which he admits that Hadoop succeeded in coming up with "the lamest name ever" (referring to YARN). All jokes aside, YARN is the future of Hadoop, and it's well worth watching the interview.

MapR is claiming 25x speedup in HBase for read-intensive loads with FusionIO. MapR's distribution uses their proprietary file system, which has a different architecture than HDFS. This might give it an advantage with systems like FusionIO -- a little over a year ago, a post on concluded that the HDFS read path had bottlenecks that would prevent HBase from seeing a speedup with SSDs. A lot has happened with HDFS in the past 13 months, though, so this might not be the case (I haven't heard of anyone running HDFS on SSDs, though).

Cloudera celebrated a big day this past week -- five years since they incorporated the company. Hortonworks spun out of Yahoo just two years ago, so this is a reminder of just how much catch up the rest of the Hadoop vendors are playing.


The 1.1.0 BentoBox "Buri" release contains updates to all components in the Kiji stack. Of note, KijiMR hit its 1.0 release, signifying api compatibility with future releases in the 1.x line.

Netflix released Lipstick, its system for visualization and monitoring Pig workflows. The code is up on github, and the blog posts highlights two neat UI features -- view sample data at each step of the workflows and toggle between the optimized and un-optimized versions of the Pig query plan.

Datameer, who's product provides a UI to analyze data stored in Hadoop, announced version 3.0. This version has some compelling new features such as decision trees, clustering, and collaborative filtering. I saw a demo of the new version at Hadoop Summit, and I was impressed with the user experience of these new features.

Cloudera announced version 0.4.0 of the Cloudera Developer Kit (CDK). The most notable change is the addition of the Morphlines library, which is a system for doing ETLs with Hadoop components by defining an ETL chain in a configuration file. Morphlines is used by Cloudera Search to ingest data from Flume.

Twitter released and presented on hRaven at Hadoop Summit. hRaven is a tool for analyzing job tracker history and is aware of higher-level DAGs (i.e. Pig or Scalding flows) that spawn multiple MapReduce jobs. It makes all of the information available via a REST API and uses HBase as a backend. Twitter is doing some cool stuff with it, such as a Pig parallelism estimator that uses historic data to set the number of reduce tasks during a Pig job.

Hortonworks has released a community preview of Hortonworks Data Platform (HDP) 2.0. The preview is based upon Hadoop 2.0.0, which means the main executation framework is YARN. It also includes Tez, which can be used by Hive as the execution engine instead of MapReduce.

WANdisco announced "S3-Enabled HDFS" which provide a compatibility layer with the Amazon S3 API. Their press release claims "Open Source Availability" -- although I wasn't able to track down the source code for this feature after searching their website for a few minutes.

Splunk announced a beta version of "Hunk", which bridges the gap between Splunk and data stored in HDFS. Hunk allows Splunk to query and interact with data stored HDFS -- a more enhanced version of the SolrCloud-based solutions from MapR and Cloudera, if I understand correctly.

VMWare announced a beta of its Big Data Extensions for vSpare, which aim to support Hadoop within vSphere. VMWare has been working on making Hadoop VM-aware (one example is a smarter block placement strategy that ensures multiple copies of a block aren't stored on virtual nodes residing on the same physical machine) through its Project Serengeti.