Data Eng Weekly

Hadoop Weekly Issue #23

23 June 2013

Hadoop Summit is coming up this week, so expect a ton of interesting content for next week's issue. Lots of folks are getting a jump start on the summit, though, and there's tons of great content from Hive to YARN to HBase… and also some interesting industry news. enjoy!


YARN introduces fundamental changes to Hadoop application life cycle, and the Fair Scheduler was rewritten to function as part of the YARN Resource Manager. In particular, rather than allocating slots as in MapReduce, allocations are based upon slices of memory which provides support for all kinds of applications. This post has an in-depth discussion of the new fair scheduler, its new and upcoming features, and also touches on the YARN architecture in the context of a scheduler.

HBase now provides a number of snapshot features-- offline/online snapshots, archives, table cloning/restoring-- that support a number of use cases from backups to providing parallel tables for development. This post details the design and implementation of the various forms of snapshots and snapshot recovery. It'll be very valuable for anyone working with or considering using HBase snapshots.

Cloudera has a blog post highlighting the new Cloudera Search features in the latest version of their QuickStart VM. There are some cool, easy to use features like real-time indexing of tweets ingested via Flume, and using the HUE front-end to run Cloudera Search queries.

HBase is targeting support for Hadoop 2.0 in its 0.96 release, but there are still a number of open issues. This tutorial describes how to get started with building and running tests for HBase against Hadoop 2.0. It's a great intro if you've always wanted to contribute but didn't know how to get started.

In a follow-up to last week's post about the Kiji Hive SerDe, this post describes using Kiji+Hive for ad hoc analysis. The post covers a bunch of Hive features that are useful even if you're using Hive without Kiji -- e.g. Views and LATERAL VIEW explode. The tutorial covers a non-trivial calculation (computing user similarity) but still manages to have very readable SQL.

Twitter presented Summingbird, its system for streaming MapReduce system. Written in Scala, Summingbird lets you implement an algorithm of a certain class once and run it both in batch with MapReduce and in realtime with Storm. The system provides eventual consistency -- inaccuracies in the real-time stream will be corrected/overwritten by the batch jobs when they run. Twitter plans to open-source Summingbird this July.


GE has been building sensor networks to do analytics in industrial plants, and they're turning to Hadoop to build their latest predictive analytics tool. The tool, Proficy Historian HD, is used to process lots of data and predict failures months in advance. It's really interesting to see a big company like GE not only adopt Hadoop, but use it to power another product.

Wired has a big article about the folks and software from Berkeley's AMPLabs. In particular, the article highlights Spark, which supports many of the same types of operations as Hadoop MapReduce (in fact "shark" provides compatibility with Hive), but it caches data in RAM. The article covers some of the use cases and companies that have been deploying or investigating Spark -- in particular Yahoo and Amazon.

Cloudera announced Tom Reilly as their new CEO. Former CEO Mike Olson has become Chairman of the Board of Directors and Chief Strategy Officer. The blog posts mentions that Tom led his previous company through their IPO, so Cloudera could be positioning for their own. The Register has some more coverage/speculation.

Hortonworks and MicroStrategy announced a partnership. As part of the announcement, MicroStrategy has certified MicroStrategy 9.3.1 to work with the Hortonworks Data Platform (HDP) 1.3.

Hortonworks is live-streaming the Hadoop Summit Keynotes Wednesday and Thursday starting at 8:30am PDT.

Scott Gnau of Teradata Labs raises an important point regarding fragmentation in the Hadoop ecosystem -- it's in everyone's best interest if Hadoop goes the way of Linux rather than Unix.

There's recently been a lot of coverage about the NSA's usage of Hadoop. This article looks at what compels the NSA and other government agencies to use Hadoop (spoiler: it's the same as every organization -- Hadoop can handle massive amounts of data for cheaper than commercial offerings). Also interesting, the article speaks about how the CIA is an investor in Cloudera.

Hortonworks and Concurrent also announced a partnership and that Concurrent's Cascading is now certified to run on HDP.


Scoobi, a Scala framework for Hadoop, hit version 0.7.0. This release uses Scala 2.10, provides the ability to run MapReduce jobs from the REPL, provides better support for counters, and contains a number of bug fixes.

Apache Bigtop 0.6.0 was released with support for Zookeeper, Flume, HBase, Pig, Hive, Sqoop, Oozie, Whirr, Mahout, Solr, Crunch, DataFu, Hue, and new additions Giraph and HCatalog. Hadoop 2.0.5-alpha provides the base platform for all of the components. The release is tested and provides binaries for a bunch of linux distros -- RHEL, Fedora, SLES, OpenSUSE, and Ubuntu.

Netflix has open-sourced their Hadoop Platform as a Service, "Genie." Genie is not a workflow engine or software to manage cluster lifecycle in the cloud, but rather is a system for allocating queries to cluster. The system allows Netflix to spin up "bonus" clusters during the parts of the day when demand is extra high without worrying about which cluster should run a query.


Curated by Mortar Data ( )

Tuesday, June 25
Hadoop Summit 2013 Pig Meetup (San Jose, CA)

Tuesday, June 25
YARN Meetup at Hadoop Summit (San Jose, CA)

Tuesday, June 25
HBase User Group Meetup at Hadoop Summit (San Jose, CA)

Tuesday, June 25
Big Data Science Meetup at Hadoop Summit (San Jose, CA)

Wednesday, June 26 and Thursday, June 27
Hadoop Summit North America (San Jose, CA)

Wednesday, June 26
Seattle Scalability Meetup: Google Compute Enginer + Concurix

Thursday, June 27
Big Data & Cloud Computing - Help, Educate & Demystify

Thursday, June 27
Impala and Big Query (Herzliya, Israel)

Friday, June 28
MapR & Qubole: Apache Drill and Hive as a Service (London, UK)