Data Eng Weekly

Hadoop Weekly Issue #9

17 March 2013

The Hadoop news stream has calmed down significantly since the numerous announcements surrounding Strata and Hadoop Summit, but we have a lot of interesting articles and two exciting project announcements this week. There were also a number of impassioned discussions about the role of enterprise in Apache Hadoop, which probably could fill an entire issue. Rather than cataloging those links, though, I've included a single link to a complication of articles and kept the bulk of this issue technical.

Thanks to everyone who has been spreading the word about Hadoop Weekly -- it reached 400 subscribers over the past week. As always, feel free to send me feedback about the content of the newsletter via twitter @joecrobak or by reaching out to |redacted| (you might notice that this issue doesn't contain any event announcements -- I've found it difficult to keep on top of them -- but please let me know if you miss that section).


Hortonworks has partnered with Microsoft to help run Apache Hadoop on Windows, and they recently released a beta version of Hortonworks Data Platform with Windows support. In the process of building this support, an impressive amount of features and improvements were added to Hadoop. Some details including a list of the most notable work done is featured on the Hortonworks blog. They also have published a blog post with a tutorial for running Hadoop on Windows.

Quantcast has used Hadoop to process data for the past several years. After identifying some inefficiencies in HDFS and MapReduce, they adopted Kosmos FS (KFS) and evolved the codebase to produce Quantcast FS (QFS) as well as implemented a new sort algorithm, QuantSort. In this interview, the leader of Quantcast's R&D team and the creator of KFS describe the motivation and history of these features.

There has been a lot of conversations about proprietary vs. open source Hadoop stacks. Alex Popescu highlights a key paragraph from the ReadWrite article that spurred a large amount of conversation, as well as links to a number of articles expressing viewpoints on both sides of the debate.

Monash Research has an update on YARN, including information on when it should become stable. The article also talks about Tez (the new framework being developed to speed up Hive and Pig execution), comparing its features to those of Spark.

"Running the Largest Hadoop DFS Cluster" is a presentation by Hairon Kuang of Facebook about operating their 100 PB HDFS cluster. In the talk, she covers a number of interesting topics, such as Facebook's highly-available NameNode, their use of NameNode federation, and raid encoding for more efficient data storage.

WanDisco is offering free online training for Hadoop -- classes cover of HDFS, MapReduce and HBase.

Amazon and other cloud providers offer services for running Hadoop in the cloud, but there are often advantages of running and configuring your own cluster. In this post, the author details a number of interesting tricks for running Hadoop in AWS, including generating "rack" information based upon availability zone, using spot instances for task trackers, and a couple of other neat tricks. Worth a read if you're running or thinking of running Hadoop in the cloud.

Many Hadoop clusters user gigabit ethernet, and we're starting to see more adoption of 10-gigabit ethernet. D.K. Panda from Ohio State has been evaluating the effect of infiband and other high performance networks on HDFS, MapReduce, and HBase workloads. He shares some preliminary results on small (<25 node) clusters in this presentation.

Tony Baer of Ovum provides a detailed look at Intel's Hadoop strategy. This is the first information I've heard about their software suite, which includes a deployment and configuration system to setup Intel's optimizations. In addition to optimizations, Intel is bringing hardware-based encryption to Hadoop (and they seem to hope to integrate this into Hadoop (see e.g. HADOOP-9331).


Parquet is a new columnar format that's been jointly developed by Cloudera and Twittter (and Criteo). Unlike Trevni, the columnar storage format supported by Apache Avro, Parquet is not tied to a single serialization framework, and it will support Protobuf, Thrift, and Avro. Parquet also has a number of advanced features such as per-column encoding (run-length, delta, etc) and a file metadata footer. At this time, Parquet has MapReduce and Apache Pig bindings. Currently under development are Apache Hive SerDes, Cascading taps, and new types of data encoding (run-length, delta, etc).

Kiji is an api and set of tools for writing applications atop of HBase. The first component release (a few months ago) was KijiSchema, for creating tables in Kiji. This week, KijiMR was released, which is a set of MapReduce libraries for Kiji. KijiMR includes Bulk importers, Gathers&Producers for accessing and populating Kiji data via MR, command-line tools, and several other features.