Data Eng Weekly

Hadoop Weekly Issue #53

19 January 2014

I really appreciate the positive response from everyone as I highlighted the one year anniversary of Hadoop Weekly last week—thanks! I expect the next 12 months to be just as busy and exciting! Speaking of, this week’s issue features an interesting story about Hike’s Hadoop infrastructure, a detailed benchmark of Cloudera’s Impala, and details on this years HBaseCon. There were also a number of interesting product releases, including a Hadoop FileSystem implementation of the Google Cloud Storage service.


Doug Cutting, creator of Apache Hadoop, recently spoke at The Hive (not to be confused with Apache Hive) about “The Future of Data.” At the talk, Doug made a number of predictions about the future of big data rooted in Hadoop. The predictions, e.g. that Hadoop will see OLTP and threaten interactive analytics data warehouse systems, aren’t necessarily new (Cloudera has been touting the Enterprise Data Hub since Strata NYC). It’s worthwhile to read through some of Doug’s ideas, though—he’s a visionary in big data.

There are a few challenges in deploying a Hive UDF (User Defined Function) library—from packaging to uploading to a DFS to adding the jar to the classpath. This tutorial walks through those steps and more with a simple pattern matching UDF. It focusses on deployment and provisioning in Windows Azure HDInsight.

The Hike engineering blog has a post on their experience with various analytics software systems. They started with Cloudera’s CDH on EC2 with Pig scripts, evaluated a few alternatives including Amazon Redshift, and ultimately decided to move to Hive 0.11 with ORC running on Hortonwork’s HDP 1.2. Since then, they’ve also tried out Hive 0.12 on Tez with HDP 2. The article has a bunch of interesting anecdotes such as their positive experience with map joins and distinct queries with Tez as well as some of the limitations such as Hive on Tez not working with HiveServer, yet.

Alex Popescu’s myNoSQL blog has a post about a paper from Microsoft Research asking if we should revisit the scale-out architecture that Hadoop has helped popularize. The reasoning is that most queries in MapReduce use only a (relatively) small amount of data, and thus computation could be sped up by funneling that data to a faster node. The post has an embedded copy of the full paper.

The Hortonworks blog features a post by Apache Hive committer Eric Hanson on performance improvements to Hive over the past 15 months. He summarizes that work (some of which isn’t yet part of a formal release) including ORC, vectorized query engine, Tez, and container re-use. The main point, though, is that Hive is improving so quickly that many benchmarks don’t evaluate the fastest version of Hive.

Cloudera has posted benchmark comparisons showing that its Impala SQL engine outperforms Hive 0.12 and a proprietary analytics engine (which cannot be named due to license restrictions). The benchmark uses 25 queries on a 30TB dataset generated by a modified TPC-DS (which can be found on Cloudera’s github). In addition to out performing the other engines, they also show that Impala scales linearly as the number nodes or users increases.

S3, the object store in Amazon Web Services, provides eventual consistency for writes. Most folks use it as persistent storage when running Hadoop in AWS, which can be troublesome since the Hadoop framework assumes that anything written can be read immediately. The folks at Netflix have built a metadata caching layer atop of S3 using AWS’ DynamoDB. They’ve written about the design of the system as well as some of their findings on inconsistencies between the S3 API and the metadata in DynamoDB over the course of several weeks.

Pluralsight has a post about the Hadoop distribution landscape. It covers a lot of the trade-offs between three of the top distros—MapR, Cloudera, and Hortonworks. If you’re trying to decide on a distribution (or are thinking about switching), this is a good place to get a lay of the Hadoop vendor land.


Cloudera has announced HBaseCon 2014, which will take place May 5th in San Francisco. The Call for Papers ends on February 14th, and there’s a 15% early-bird registration discount through February 10th.

This update is two weeks old at this point, but DataFu has been accepted into the Apache Incubator. DataFu is a collection of Pig UDFs (including PageRank, sessionization, set operations, sampling, and much more) that were originally developed at LinkedIn.

ReadWrite has a post on the hype surrounding Hadoop. In particular, it talks about the challenges (and lack of motivation for some organizations) in deploying Hadoop. The article also touches on Hadoop vendors over-promising Hadoop—for instance speculating that it will one day support online transaction processing (OLTP)—which can lead to confusion or unmet expectations.

Cloudera and Koverse have announced a partnership. Together, they will continue to jointly develop Apache Accumulo for Cloudera Enterprise. In addition, they will certify that their products work together without modification.

InfoWorld has awarded Hadoop one of the 35 technology of the year awards. The award highlights the recently release of Hadoop 2.0 and YARN.


Pydoop 0.11.1 was released. This release includes support for Apache Hadoop 2.2.0 and 1.2.1. Pydoop provides a different solution to MapReduce and HDFS support than most other python libraries—it uses the C (libhdfs) and C++ (mapred pipes) libraries to access them.

Version 0.10.1 of the Kite (formerly Cloudera Developer Kit) SDK has been released. This version resolves a handful of bugs, including issues related to partitions with dates and updating the Hive MetaStore with new partitions.

Zettaset, makers of management software for Hadoop, have announced support for encryption at rest as part of their Zettaset Orchestrator tool. The system uses 256-bit AES encryption, and they claim low-overhead for processing encrypted data. Orchestrator is compatible with several distributions, although it’s not clear if encryption at rest is supported with all of them.

Google has announced the Google Cloud Storage connector for Hadoop. Google Cloud Storage, which is part of th eGoogle Cloud Platform, is an object store built atop of Collosus, the latest version of the Google File System (which inspired HDFS). The announcement introducing the new library (which implements the Hadoop FileSystem API) includes a number of benefits of doing so. Data locality is lost, but Google claims that performance is comparable to HDFS.


Curated by Mortar Data (

Monday, January 20

Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop (New York, NY)

Tuesday, January 21

Maria Gullickson & Robert Fehrmann: Hadoop + Recommendation Engine (Richmond, VA)

Savanna: Elastic Hadoop on OpenStack (Rocky Hill, CT)

Munich Datageeks Meetup - January Edition (Munchen, Germany)

St. Louis Hadoop Users Group Meetup (St. Louis, MO)

Wednesday, January 22

January Kiji User Group Meetup (San Francisco, CA)

Scaling HBase (nosql store) to handle massive loads at Pinterest (Santa Clara, CA)

Koverse + Samza | special Eastside Meetup (Seattle, WA)

Impala Talk - Ricky Saltzer of Cloudera (Orlando, FL)

Using Lenses to Structure State and Conquering Hadoop with Haskell (New York, NY)

Advanced Hadoop Based Machine Learning (Austin, TX)

Thursday, January 23

Get started on Hadoop (Free Hands-on Session) (San Jose, CA)

Deploying enterprise grade security for Hadoop (Saint Paul, MN)