Data Eng Weekly

Hadoop Weekly Issue #37

29 September 2013

I've had a few conversations with folks recently about the massive amount of Hadoop-related content that's published on a weekly basis. When I first started Hadoop Weekly, I basically included every post that I could find. But now there are too many articles, and I end up choosing the ones that I think have the most appeal and widest applications. There are inevitably a bunch of high quality articles that I end up cutting, and it's always a really hard decision.

With that said, if you ever think that I've missed something important, let me know and I'll try to include it in next week's issue.


Hortonworks and Qubole have partnered on a new Hive Cheat Sheet focused on Hive Functions (both builtin and user-defined). In addition to the cheat sheet, the blog post contains a great introduction to Hive -- everything from metadata operations (i.e. SHOW TABLES and SHOW FUNCTION) to querying to writing UDFs.

Slides and a recap from last week's Apache Ambari Meetup event are on the Hortonworks blog. There are three slide decks -- the first covers the past few releases of Ambari, the second covers support for YARN, and the third covers support for NameNode HA. It's pretty impressive to see how far Ambari has come in the past several months -- it's gained a number of features previously only in commercial management tools.

Mortar's Hadoop offering has supported CPython for Pig for a while now, but Apache Pig's python support was limited to Jython and streaming. As of this week, Pig trunk and the 0.12 branch include native CPythong UDF support. The new compatibility means that a lot of previously unavailable libraries (such as numpy and scipy) are now available for Pig-python users. To be clear, this support isn't yet part of an Apache Pig release, but will most likely be in the upcoming 0.12 release.

The Cloudera blog has the first in a multi-part series on the HBase Thrift interface. This post covers what's needed on the HBase side-of-things to run a HBase Thrift server as well as sample code in python to connect to the server. It also has some examples for converting between HBase Thrift wire data and python types, and what to expect in terms of RPC error responses.

Given the changes to and generalization of the computational model in Hadoop 2.0, there's a need for a new, more general scheduling framework. To that end, YARN contains a revamped capacity scheduler that manages the resources of the NodeManagers in the cluster (as opposed to Map/Reduce tasks in MRv1). The Hortonworks blog has an overview of the capacity scheduler's features, configuration options, and access controls.

Cloudera is planning two big releases of Impala -- the 1.2 release by the end of 2013 and a 2.0 release in 2014. Their blog has some details on the features expected for both releases. The 1.2 release will be getting in-memory HDFS caching as well as Java and native UDFs. Impala 2.0 will be getting more analytic window functions and support for nested data types (as well as much more).

The Hortonworks blog has a third post in the series on Apache Tez, the data processing framework built on YARN. This post focuses on the Java API for Tez, which is composed of three main parts -- Input, Processor, and Output. Covering how all the pieces fit together, the post uses MapReduce and MapReduceReduce as concrete and familiar examples. It also contains an overview of some implementations of the main interfaces that are bundled with Tez.

Michael Brown, CTO of comScore presented on their Hadoop stack at the Big Data Warehouse Meetup in New York. comScore ingests an enormous amount of data across billions of users, and there are some interesting lessons learned in scaling a MapReduce job to a dataset this large. In particular, they had issues with the speed of the shuffle phase, which they solved by partitioning the data into sorted chunks. In addition, the presentation covers how comScore uses Syncsort's DMX software.

OSSEC is a generic, open-source intrusion detection system. This post explains how to integrate OSSEC with a secure Hadoop deploy in order to generate security events related to Hadoop. In particular, it focuses on gathering authentication failure events from the NameNodes' logs. This is accomplished by configuring a decoder in an XML configuration file and adding some alert rules to a second configuration file.

Slides from the recent Toronoto Hadoop User Group meet up have been posted online. The slides cover Apache HBase, providing a really excellent technical deep dive into all aspects of the system. Specifically, they cover everything from the HBase architecture, the HBase data model, the read and write paths, operational aspects, SQL integration, and more.


Databricks is a stealth startup founded by the team behind Apache Spark and Shark. They made the news this week when it was discovered that they've secured $13.9 million in venture capital funding in a round led by Adreessen Horowitz. GigaOm has more coverage on Databricks as well as background on Apache Spark and Shark.

Datanami has coverage of Intel's partnerships with Oracle and SAP, and they're strategy of going after the enterprise Hadoop market. The post notes that their are around nine Hadoop distributions, and companies like Oracle and SAP would rather partner than start maintaining their own. The article also predicts that we'll soon start seeing a downward trend in the number of Hadoop distributions.


Pydoop 0.10.0 was released with fixes and improvements. In particular, it now has support for CDH 4.3.0 and a new walk() method for the HDFS API.

Apache Sentry, a fine-grained authorization system for Hadoop, released version 1.2.0-incubating. Over 40 issues were closed as part of this release, including a number of bug fixes and improvements to the testing infrastructure.

Apache Cassandra 1.2.10 and Cassandra 2.0.1 were released. The 1.2.10 maintenance release resolved over 20 issues, and the 2.0.1 release contains dozens of fixes and improvements -- too many to list here.

Amazon has open-sourced samples and the bootstrap action code for Elastic MapReduce on github. The examples cover a number of different languages and frameworks such as node.js and spark. The bootstrap actions contain code for deploying and configuring systems such as ganglia, HBase, and accumulo.

gohadoop is a new framework for writing YARN applications using the Go language. As explained in a post on the Hortonworks blog, one of the goals of the framework was to validate the cross-language compatibility of the YARN API (which is built with protocol buffers). The gohadoop framework includes some low-level wrappers of the hadoop RPC protocol as well as a higher-level 'yarn_client' API. The code, on github, ships with a distributed shell application.

Llama, or "Low Latency Application MAster," is a new project from Cloudera to connect Cloudera's Impala with YARN. The project website has a lot of technical details that are dense and low-level, but there is also a good explanation of what it does and the motivation. In essence, Llama allows Impala to run outside of YARN but to follow resource allocations that are given out by YARN. Marrying two complex distributed systems like Impala and YARN is a tricky task and the docs explain many possible solutions, why Llama's architecture was chosen, and give some background on how Llama is implemented.


Curated by Mortar Data ( )

Monday, September 30

U-CHUG Kickoff Meeting: Deploying the Next Generation of Hadoop at Yahoo (Urbana, IL)

Spark 0.8 release and Spark at Bizo (San Francisco, CA)

Presenting MongoDB + Hadoop integration (Barcelona, ES)

Tuesday, October 1

Storm at TheLadders - Tuning IO Bound Storm Topologies ( New York, NY)

The challenge of the on-line serving massive batch-computed data sets (Tel Aviv-Yafo, Israel)

Data Science - R and Hadoop (Use cases and Deep dive by Revolution Analytics) (Flemington, NJ)

Wednesday, October 2

John Foreman: Learning to do data science at MailChimp (Atlanta, GA)

Thursday, October 3

Hadoop / HBase (Paris, FR)

Building a Time Series Database for Financial Data with Cassandra (Boston, MA)

Friday, October 4

Les DataGeeks à l'OpenWorldForum (Paris, FR)

Saturday, October 5

Bangalore Baby Hadoop Group Next Meet-Up (Bangalore, India)