Data Eng Weekly

Hadoop Weekly Issue #8

10 March 2013

I missed a few interesting tidbits of news last week, so there are a few carry-overs. As always, if you think I missed some important news let me know, and I'll be sure to add it to next weeks' newsletter if appropriate.

After all of last week's announcements, there are a number of blog posts providing technical background on new productions. In addition, this week's newsletter contains are quite a few other interesting releases and technical posts.


GigaOm has compiled top-10 lists of Hadoop-related coverage over the past 10 years. They link to some fun, historic articles such as when Cloudera got their Series A and Hortonworks was spun out of Yahoo. If you're newer to Hadoop and haven't been following for the past several years, this is a really great article.

Apache Mesos (incubating) is a resource scheduler with some features in common to Apache Hadoop's YARN, and it can also be used to provision and manage a Hadoop cluster. This article has an interesting summary about how Mesos came to be -- from a sorta accidental project at Berkeley's AMP Lab to an open-source project with many contributions from Twitter to being run in production in Twitter's data centers.

Many of the recently announced SQL-on-Hadoop systems run outside of MapReduce and store data in a columnar-major format in order to improve query performance. With Amazon Redshift as a proxy for many other systems, this Quora answer contains a good overview of the reasons such systems can be much faster than Apache Hive.

Not a new release, but I saw mention of JobTracker for the first time. It's a Mac menubar application with Growl notifications to tell you when your jobs start and finish -- I've found it pretty handy so far. Since the jobtracker doesn't expose a java or rest api, it's scraping the job details page to detect completed jobs, so YMMV depending on Hadoop versions.

It seems like every few months we hear about a new MapReduce sorting record. Since most applications are more complex than a simple sort, it can be hard to translate these speedups to other use cases. A lot of smart folks from industry and academia are getting together to spec a new big data benchmark. The goal is to give industry and academics a level playing field to announce and test results for new systems and improvements.

We know that Salesforce are using Apache HBase due to their recent open-sourcing of Phoenix to run SQL queries on HBase, but this blog post reveals that they're Apache Pig users, too. In an interesting twist, they're also auto generating Pig scripts from Java.

There have been a lot of product X vs. Hive bake-offs, and here's one for the latest release of Impala. It shows that the new RCFile support helps performance a bit, and that Impala is 10x faster than Hive.

A Cascading-centric presentation about functional programming for BigData. It also includes some interesting historic perspective, case studies from industry, examples in Cascalog (closure) & Scalding (scala), and more.

Voting is open in the community choice for Hadoop Summit, which being held this June in San Jose. The topic areas are: Applications and Data Science, Deployments and Operations, Enterprise Data Architecture, Future of Apache Hadoop, Hadoop (Distruptive) Economics, Hadoop Driven Busineess/Business Intelligence, and Reference Architectures. You can get up to 3 votes per track.

Axemblr's Provisionr has been voted into the Apache incubator. Provisionr is a service for starting up cloud-based clusters and is a common solution for spinning up Hadoop clusters in EC2.

CDH4.2's HBase has an exciting new feature from trunk -- snapshots. Snapshots are a fast operation that only company metadata but not the underlying table data. Cloudera has more details on their blog.

MapR powered a display with tweets referencing #strataconf during the conference using MaprFS and Storm. Probably overkill but interesting nonetheless.

While Hadoop has had security via Kerberos for a few years, the network layer has historically been unencrypted. Apache Hadoop 2.0-alpha added support for network-layer encryption in addition to Kerberos. Cloudera has an article describing this new feature, including a walkthrough for setting it up.


I forgot to mention some exciting news from the Cloudera folks last week. They released CDH4.2 and Cloudera Manager 4.5. One of the highlights of CDH4.2 is the addition of HCatalog to their Hadoop stack. Cloudera Manager 4.5 has been in beta for a while, and the release has some exciting new features including the ability to do rolling upgrades when installing packages via "parcels" rather than RPMs/DEBs, support for Hive as a service, and the removal of the 50 node limit for the free version.

Apache Avro 1.7.4 was released. It's mostly a bug fix release, but has some improvements to support the Trevni file format from MapReduce.

Impala 0.6 was released with support for RCFiles as well as a number of fixes. Impala 0.6 requires CDH4.2 and Cloudera Manager 4.5.

Spark is a framework for in-memory computation on data read out of HDFS or S3. The latest release, Spark 0.7, includes a python streaming api, performance improvements, and bug fixes. Spark really seems to be taking off -- the latest release had contributions for 31 different developers, 20 of which were outside of Berkeley.

Apache Hadoop 1.1.2 was released. This release includes a number of bug fixes, including compatibility with IBM's JDK and fixes with Kerberos security.

The 0.96 release of Voldemort, the distributed key-value store from LinkedIn, was announced this week. This release has improved performance monitoring, rebalancing, and read-only pipeline. They're also working on improving performance on SSDs and are soliciting feedback from the open-source community about how to make the project more successful.