Data Eng Weekly

Hadoop Weekly Issue #40

20 October 2013

Apache Hadoop 2.2.0 was released this week, the General Availability (GA) release. GA means that the APIs are considered stable, although there is some discussion about just how production-ready the release is. In addition, this week saw new releases of Apache HBase, Apache Hive, and Apache Pig, as well as a commitment from Hortonworks to begin supporting Apache Storm in the Hortonworks Data Platform. As expected, this week's newsletter contains a bunch of content about Hadoop 2, but there are also several non-Hadoop 2 related articles.

Congrats to the teams who worked hard on all of the releases that went out this week!


Vagrant is a tool for automating local VirtualBox or VMWare VMs. The folks at Concurrent have been building Vagrant configs to automate launching of a development Hadoop cluster for running the Concurrent stack. A new branch of this project contains support for Hadoop 2.2.0, which means you can have a local, fully-functional Hadoop 2 in about 5 steps (make sure to see

Folks from eBay, Scaled Risk, and Hortonworks partnered to improve HBase's Mean Time to Recovery (MTTR) in HBase 0.95+. In order to verify and validate their improvements, they've built a new test framework to simulate various failure scenarios in an HBase cluster. In particular, this post covers MTTR improvements across 8 scenarios across 3 properties -- read/write, node/regionserver failure, and activity level (hot/cold). In these scenarios, they find that the MTTR has improvement significantly (to as little as 15seconds) over the baseline value of 180 seconds from HBase 0.92.

HUE, the front-end for Hadoop, has a new single sign-on authentication feature. It is implemented as a new backend that's SAML 2.0-compliant. The post covers how SAML is integrated into HUE, how to configure Hue for SAML, and the various configuration options in the hue.ini config file. This new feature is scheduled to ship in the next version of the Cloudera platform.

The Hortonworks blog has a technical deep dive into HDFS 2.x's NameNode High Availability. It covers all of the components the system -- the active and hot standby NameNodes, the JournalNodes (to which the edit log is written), and the Zookeeper Failover Controllers. The architecture has a lot of pieces, so it helps to see the diagram and to have it explained concisely.

One of the big changes to MapReduce in YARN is that the JobTracker doesn't exist any more. Rather, each MR job creates an Application Master which lives for the duration of the job. Thus, job histories are now stored in HDFS, and there is a new web UI daemon responsible for serving up the job history. The Hortonworks blog has more details about the Job History server and how it's integrated into Ambari in HDP 2.0.

Microsoft SQL Server has a component called SQL Server Integration Services (SSIS) for data import/export. It doesn't yet have support for HDFS, but SSISHDFS is a new library that uses the Microsoft Hadoop .Net SDK to load data into HDFS. This article walks through how to setup and run SSISHDFS to load data from SQL Server to HDFS.

At the recent Apache Hadoop YARN meetup, there were a number of interesting talks covering applications on YARN and the future of YARN. Specifically, talks covered the Application History Server, ResourceManager reliability, Apache Tez, YARN at LinkedIn, Llama, and Go Hadoop. This post has a short summary of each presentation and links to more details.

The Cloudera blog has an overview of Hadoop API compatibility across versions. Cloudera plans to mirror its compatibility rules to those of Apache Hadoop releases (the article links to the Apache Hadoop compatibility guide, which is an interesting read). The quick summary is that you should only use APIs marked public, and if you are using a public-stable API then you are guaranteed only compatible changes for at least one more major version.


Hortonworks announced that they're adding support for Apache Storm to the Hortonworks Data Platform (HDP). By Q1 2014, they plan to have integrated Storm with Ambari, which will deploy via YARN and support monitoring in Ganglia and Nagios. They plan to add new features in subsequent releases, including integration with HDFS, HBase, & Hive, LDAP authentication, high availability, and an advanced scheduler. Storm will be integrated into a preview release of HDP 2 this fall.

Hortonworks and SAS announced a "strategic alliance" to integrate SAS's business intelligence tools with the Hortonworks Data Platform. SAS already has partnerships with Cloudera, Intel, and Pivotal, so adding Hortonworks helps round out partnerships with the key Hadoop distributions.

The SD Times has an interview with Hadoop founder Doug Cutting. In the interview, Doug talks about how Hadoop gained market share by implementing the most important features first and putting off as many advanced features as possible until the platform matured (which its now doing, as Hadoop 2.x shows). Doug also speculates that Hadoop will continue to gain market share for the next 10 years and that Hadoop 3.0 might focus on transactions and further improving multi-tenancy. Click through to the article for even more interesting insights from Doug.

Zettaset, makers of orchestration software to automate Hadoop deployment, high availability, and security, are suing Intel. GigaOm has more details on the case, including a summary of the complaint.

DataStax has announced a trifecta of new initiatives -- "DataStax Enterprise for Startups," a new free online Cassandra training and certification course, and "DataStax DevCenter." DataStax Enterprise for Startups provides free enterprise software to startups who have raised less than $20 million and aren't generating too much revenue. DataStax DevCenter provides an IDE for writing CQL Queries, navigating data schemas, and to manage connections to Cassandra clusters.


Apache Pig 0.12.0 was released. It includes several new features and improvements, such as Streaming UDFs (adding interop with e.g. CPython), a new ASSERT operator (which does what you'd expect), various improvements to AvroStorage, a BigInteger data type, and compatibility improvements on Windows.

Apache Hive 0.12 was released. The Hortonworks blog has some more details on the new features (that span over 400 tickets!). Major improvements include: VARCHAR and DATE types, predicate pushdown for ORCFiles, optimizations for several common deficiencies (e.g. LIMIT in mappers, parallel order by, group by plus join). This release marks phase two (of three) from the Hortonworks Stinger Initiative to bring 100x speedups to Hive.

The highly anticipated Hadoop general availability (GA), Hadoop 2.2.0, was released. The GA label for this release means that APIs are stable, and it is out of beta. The release has tons of new features. A few of the most noteworthy are YARN, NameNode High Availability, and HDFS snapshots. Both the Cloudera and Hortonworks blogs have recaps of the new features in Hadoop 2. Of note, Cloudera affirms that CDH5 will be entirely based on YARN, and Hortonworks talks about some of the future work to improve YARN.

Shark, the platform from the folks at Berkely's amplab that implements Hive on Spark, released 0.8.0. The new version includes a number of new features including improved shuffle performance for larger aggregations/joins, in-memory columnar compression (run-length and dictionary), and partition pruning. In addition, it's based on Spark 0.8.0, which adds a web-based monitoring UI and new scheduler support.

Apache incubator project Blur released version 0.2.0-incubating. Blur provides a search engine with sub-second latency for data stored in HDFS. It accomplishes this by combining a number of Apach projects -- Hadoop, Lucene, Thrift and Zookeeper. Version 0.2.0 is the first release inside of the Apache incubator, and it has some exciting new features like index snapshots, support for GIS data types, and an upgrade to Lucene 4.2. In all, over 100 tickets were resolved including many bug fixes.

Apache HBase 0.96.0 was released. The 0.95 series of HBase releases (in fact all odd numbered releases) was a development series, so this release series supersedes the 0.94.x series, which first released in May 2012. It has a ton of new features including protobuf-based RPC, improved Mean Time to Recovery, client-side types, and more (see the announcement for a more thorough list). The 0.96 release is compatible with the 1.x and 2.x Hadoop releases, but it is not binary compatible with the 0.94.x HBase series (clients have to be upgraded). It also requires a full-cluster restart to upgrade.


Curated by Mortar Data ( )

Monday, October 21

Monday Lunches at Jefferson Tap (Chicago, IL)

Hadoop Machine Learning Project (Austin, TX)

Chicago Big Data Analytics Group Kickoff (Chicago, IL)

Tuesday, October 22

Hands-on Programming using Hadoop, MapReduce and Hive (Madison, WI)

R & Hadoop and Panel on "Resolved: Traditional Statistics is Dead" (Denver, CO)

Divide and Recombine (D&R) for Large Complex Data (New York, NY)

Introduction to HD Insight - Harin Sandhoo (Vienna, VA)

October Hadoop Meetup: Streaming analytics and approximation (London, England)

Wednesday, October 23

Spring Seminar Series: Big Data Infrastructure (Tacoma, WA)

Logging Solutions (Chicago, IL)

San Francisco Kiji Meetup (San Francisco, CA)

CLDS Tech Talk: YARN (Yet Another Resource Negotiator) (La Jolla, CA)

So You Added ZooKeeper To Your Stack, Now What? (New York, NY)

Cloudera Search + RDS (MySQL) (Seattle, WA)

Hadoop Based Machine Learning

Thursday, October 24

Developer Meetup: What's after 0.96? 1.0? (Palo Alto, CA)

Blizzard's Hadoop Program (Irvine, CA)

Ted Dunning Presents "Real-time on Hadoop: A New Approach Makes it Easier" (Boulder, CO)

Solr + Hadoop = Big Data Search (New York, NY)

HBase Meetup at HortonWorks in Palo Alto (Palo Alto, CA)

MongoDB Madrid MUG - "MongoDB + Hadoop integration". (Madrid, Spain)

Saturday, October 26

EXPERTALKS: Oct 2013 - Fire it up with Ember (Pune, India)