Data Eng Weekly

Hadoop Weekly Issue #22

16 June 2013

HBaseCon and Cassandra Summit were both this week, so a few presentations made it into this week's newsletter. As more presentations make their way online, please be sure to let me know by tweeting @joecrobak. In the meantime, this week's newsletter is chock full of interesting articles and news.


Apache HBase is a powerful system for storing and querying key-value data, but writing applications for HBase can be a challenge. HBase's apis all operate on arrays of bytes, and there are a number of important tricks for building a scalable schema. New frameworks like Kiji and the Cloudera Developer Kit aim to make writing HBase applications easier. Michael Stack, the chair of the HBase PMC, shares some thoughts on the future of high-level frameworks for HBase.

At Berlin Buzzwords, Ted Dunning presented on using Apache Mahout with Apache Solr to build a recommendation engine. Many recommendation engines are built on co-occurrence (i.e. users that watch video A also watch video B), but there is usually a lot more information that is ignored in such a framework (e.g. search queries, item descriptions). To take advantage of this extra information, Ted suggests a new architecture that takes the output of a MapReduce job and generates Solr queries to retrieve results. An interesting idea, particularly since MapR and Cloudera have both started bundling Solr with their distributions.

Cloudera Impala daemons run alongside of TaskTrackers on the same physical hardware. Running both Impala and MapReduce simultaneously can cause resource contention for CPU, RAM, and disk access. Cloudera Manager provides the ability to use linux containers groups to provide priority weightings for the two components in the case of contention (e.g. allocate 25% of cpu to impala and 75% to MapReduce). Cloudera released some data from experimenting with mulit-tenant performance, and it shows that observed performance is often better than expected-- most likely because MapReduce and Impala don't share the same resource limitations (at least in this experiment).

Adam Ilardi presented on how eBay uses Scala with Hadoop at the NY Scala meetup. The talk is accessible to folks that don't know Scala, and it covers a bunch of Hadoop frameworks (Pig, Cascading, and Scoobi), how they've been used at eBay, and why they're so bullish on Scalding.

Yahoo! has been using YARN from the 0.23 branch of Apache Hadoop in production over the past few months. They've written and open-sourced "Storm-YARN" to run Storm and MapReduce on the same YARN cluster. This is pretty big news since it's the first main-stream YARN system aside form MapReduce (that I've heard of at least). The blog posts has more information about usage (seems pretty straightforward) and links to the code on github.

I always forget that the Microsoft cloud, Azure, supports Linux VMs as well as Windows. This post details setting up a CDH cluster with Cloudera Manager. There are some more lower-level details to setup than with other cloud environments -- particularly DNS -- but that also gives you much more control than you might have otherwise.

Honeycomb is a new storage backend for MySQL that uses Apache HBase. Based upon their performance tests it seems to have some overhead vs. raw HBase, but it makes HBase accessible to anyone that is familiar with MySQL. It supports a number of important features from MySQL (such as joins, stored procedures, and views) as well as HBase (automatic sharding, replication, and integration with MapReduce). This is certainly an interesting twist on the SQL-on-Hadoop fad.

Kiji now has a Hive deserializer which allows users to run HiveQL queries over data stored in Kiji. The HBase data model doesn't directly map to Hive's (e.g. HBase stores multiple versions of the data), so the data mapping can be controlled with serde properties.

Hortonworks Sandbox 1.3 includes some interesting new features, such as a visualizer that graphs the results of hive queries as bar, scatter, pie, etc charts and improved virtualization support (HyperV support added, 32-bit OSes support added, VirtualBox support improved).


Supercomputer vendor Cray announced a Hadoop solution based upon their Cray CS300 line and Intel's distribution. There are a few interesting things here. First, the supercomputer community (well Cray, at least) is accepting/acknowledging Hadoop, and we could see some interesting mixes of traditional HPC and Hadoop. Second, Cray's choice of distributions puts the Intel distribution on the map.

Cassandra Summit 2013 was this week. I saw a number of tweets throughout the conference mentioning some great talks -- I can't wait for all the slides and presentations to show up online. In the meantime, DataStax has a recap on their website.

HBaseCon 2013 was also this week, and based on some tweets that I saw, it also had a number of great presentations. I'm looking forward to the talks that will be posted online over the next few weeks.

Red Hat Summit was this week, and Hortonworks and Red Hat announced a partnership to work integrating Red Hat Storage with the Hortonworks Data Platform. The partnership consists of joint-work on an implementation of the FileSystem API for Red Hat Storage (GlusterFS) and enhancements to Apache Ambari to configure the file system for Hadoop.

Last week, Cloudera made news with their "unaccept the status quo" announcement. While Cloudera's position on the future of data platforms is clear, do other folks in industry agree? CMSWire asked a number of folks from Cloudera, Microsoft, EMC, SAP, Teradata, and WANdisco about Hadoop's significance and future. While not everyone agrees 100% on how big the role is, every does agree that Hadoop will be an important part of data warehousing.

Altiscale announced $12M in series A funding. The company employs a number of ex-Yahoo, LinkedIn, and Google engineers and is building out a Hadoop as a service product. Unlike other Hadoop in the cloud systems, you add your data to their cluster and pay for a monthly 'baseline' amount of usage with the ability to go over if needed (somewhat analogous to a minute allotment in a monthly cell-phone contract). There are certainly a number of companies that would be happy to outsource Hadoop infrastructure and instead focus on their business problems. It should be interesting to see how Altiscale's market share picks up.


Curated by Mortar Data (

Monday, June 17
Cleveland Big Data and Hadoop User Group (Mayfield Village, OH)

Tuesday, June 18
High Performance Computing (HPC) in the Cloud - By Amazon AWS (Tel Aviv-Yafo, Israel)

Tuesday, June 18
Data Scenarios for Hadoop (Cambridge, MA)

Tuesday, June 18
St. Louis Hadoop Users Group Meetup (St. Louis, MO)

Tuesday, June 18
Storm in web analytics / Storm Q&A with Nathan Marz (London, UK)

Tuesday, June 18
Dating with Models: A How-to Guide for Programmers and Architects (Santa Monica, CA)

Wednesday, June 19
Big Data & Analytics Meetup (Chicago, IL)

Wednesday, June 19
Hortonworks, the future of Hadoop, and Agile Big Data (Denver, CO)

Wednesday, June 19
Solving analytic problems in the real world (Cambridge, MA)

Wednesday, June 19
Montreal Storm Meetup (Montreal, QC)

Wednesday, June 19
Real-time Big Data (San Francisco, CA)

Wednesday, June 19
Live demo of Pig and Hive running on a MapR Hadoop cluster (Poway, CA)

Wednesday, June 19
Upgrading HBase by Vaibhav Puranik of GumGum (Santa Monica, CA)

Wednesday, June 19
Talk Big Data (Cambridge, MA)

Thursday, June 20
Real-time in a Big Data World - Cloudera Impala (Melbourne, Australia)

Thursday, June 20
Implement A Distributed Math algorithm in 2hrs: K-Means (Mountain View, CA)