Data Eng Weekly

Hadoop Weekly Issue #43

10 November 2013

I was expecting to have a short and sweet newsletter this week as everyone recovered from the Strata/Hadoop World hangover. But it turns out there’s a lot of great content this week (and even more great stuff which I couldn’t squeeze in). If I had to pick a them for this week, it seems to be YARN — memcached on YARN, Tez on YARN, and guides to migrating to YARN. The folks at Facebook also open sourced (they’re decidedly not-YARN) SQL-on-Hadoop system, Presto. Enjoy the articles.


The Cloudera blog features a post by Concurrent (the makers of Cascading) on using Cascading Pattern with Hadoop. Cascading Pattern is library of machine-learning algorithms. Most interestingly, Pattern supports an industry standard file format for defining predictive models, Predictive Model Markup Language (PMML). Thus, you can do something like build a model on a small set of data in R and then use the same model on the full dataset in Hadoop with Cascading Pattern. The post includes a tutorial for doing just that.

The Hortonworks blog has a post on running memcached on YARN (so-called ‘moya’). The post describes adapting the distributed shell example to run the JMemcache daemon (which is a java implementation of the memcached protocol). In order to make the cluster useful, the nodes register themselves with Zookeeper (which a client can use to find the nodes in the cluster). This seems like it’ll be a fairly common approach to exposing dynamic resources created in YARN. The post includes the code for moya and an example of running a moya cluster.

In the latest in his series on MapReduce frameworks, Matthew Rathbone has a post on using Scoobi (which is a Scala library for MapReduce) to implement a left outer join. The Scoobi solution is a bit longer than the Pig and Hive solutions, but it’s much more succinct than the Java solution (and also offers static typing).

In the seventh article in their series on Apache Tez, the Hortonworks blog has a walkthrough of Tez’s support for sessions. A Tez session is analogous to a database session in which a user opens a shell and communicates with the DB. In Tez, this amounts to starting an Application Master which has reserved resources for the user. The post covers the motivation for Tez sessions, the session API, the performance benefits, and provides an example using the command-line to run a sessionized Tez query.

The Knewton blog has a post on MapReducing over data stored in SSTables, the storage format of Cassandra. In order to prevent overloading the production Cassandra cluster when running MapReduce jobs, they use backup data stored in Amazon’s S3 (which doesn’t require a running cluster). Knewton has published some libraries in the KassandraMRHelper project to assist in running jobs over Cassandra data without a Cassandra cluster.

Allen Wittenauer, who is an outspoken Hadoop system administrator, presented at the LISA conference on Hadoop for SysAdmins. His presentation has a good overview of some of the initial setup required for deploying Hadoop, and it also has some choice quotes like “Hadoop isn’t designed for system administrators and/or support staff.” If you’ve deployed Hadoop or are planning to deploy Hadoop, this talk has a lot of valuable information.

YARN and MRv2 introduce a lot of changes versus traditional (version 1.x) MapReduce. The Cloudera blog has two posts detailing the changes and what’s needed to migrate to YARN. The first post is focused on the migration for developers and the other is focussed on the migration for operators. I found that the latter post had a lot of interesting details, from migrating configurations to the changes in resource configuration options (RAM and vcpu slots rather than map and reduce slots) to the YARN scheduler to the upgrading (and more). Both posts also have details about the web UIs and the basic architecture changes.

Facebook has open-sourced its SQL-on-Hadoop solution, Presto. The post has some information on the architecture and roadmap. Presto supports many data sources, including HDFS, HBase, and Scribe. Interestingly, Presto is deployed in multiple Facebook data centers, and they have scaled it to 1,000 nodes. Presto is used by over a thousand users to process over 1 PB daily.

Apache Zookeeper PMC Member Camille Fournier gave an introductory talk on Zookeeper. Her talk covers the purpose and typical use-case of Zookeeper in a distributed system. It also covers the data model, zookeeper watches, and the client API. g33ktalk has posted the video and links to the slide.

Apache Drill committer Timothy Chen has written up an overview of the lifetime of a query in Drill. A query starts as SQL at a client, which is then parsed by the Optiq library, which generates a logical query plan. The logical plan is then converted into a physical plan, which is made up of fragments which are actually executed on the Drill cluster.

Dryad is a data processing framework that was built by Microsoft Research that shares some similarities to Apache Tez. This post talks through the origins of the two projects, the features and concepts that they share, and the overall architecture (YARN, resource management, etc). There is also a conclusion section with some key take-aways.


MapR announced new security features for their distribution. Their so-called “Native Authentication” adds a non-Kerberos-based authentication solution that leverages key-based security. Given the investment required in deploying and configuring Kerberos, this lowers the barrier to entry for securing MapR’s distro.

CloudTimes summarizes an IDC survey commissioned by Red Hat entitled “Trends in Enterprise Hadoop Deployments.” Highlights include that 32 percent of companies already have Hadoop deployed and 31 percent intend to deploy Hadoop in the next 12 months. There’s also information about how Hadoop is being used in conjunction with other databases, the types of problems being tackled with Hadoop, and information about the various filesystems being used. The article also has a link to the full 39 page report.


Snakebite 1.3.4 and 2.0.0 were released. The 1.3.x series supports protocol version 7 and 8. Version 2.0 proves compatibility with Apache Hadoop 2.2.0 and Hortonworks HDP 2.

Parkour is a new library for writing MapReduce jobs in idiomatic Clojure. From the announcement, “Programs using Parkour are normal Clojure programs, using standard Clojure functions instead of new framework abstractions.” The announcement has an example of the classic word count program, and a link to the code & documentation on github.!msg/clojure/fgAlkFU0guI/nPKH1VS5xZ8J

Amazon announced an upgrade to the Elastic MapReduce console. The new version features the ability to target a specific Availability Zone, clone a cluster, create clusters with IAM roles, and much more. The post announcing the new console includes a number of screenshots to give you an idea of the new features.

Apache Bigtop 0.7.0 was released. Per the release notes, this version adds support for SolrCloud 4.5, Phoenix (the SQL-on-HBase library), Apache Spark, and an improved HUE web UI. The release uses Hadoop 2.0.6-alpha, HBase 0.94.12, and Hive 0.11.0

Apache HBase 0.94.13 was released. This version includes 30 fixes, performance improvements, and more. Users of the 0.92.x and 0.94.x series of HBase can do a rolling upgrade to 0.94.13 without downtime.

Apache Curator, the Java library for interfacing with Apache Zookeeper, released version 2.3.0. This release resolves 20 issues, mostly bug fixes.

DataStax has open-sourced their C++ Driver for Apache Cassandra. The driver supports CQL, and it contains all the major features found in the Java, C# , and Python drivers. The first beta release is forthcoming, but source code has been posted on github.


Curated by Mortar Data ( )

Monday, November 11

Intro to Accumulo (Houston, TX)

Introduction to Hadoop and the Hadoop Ecosystem (Champaign, IL)

November Meetup: Strata special and panel discussion (London, England)

Tuesday, November 12

Security and Hadoop with Apache Knox (Cambridge, MA)

Wednesday, November 13

Spark, Scala, and the Berkeley Data Analytics Stack (London, England)

Hadoop Based Machine Learning (Austin, TX)

November Meetup - Revolution R Enterprise 7 (Portland, OR)

Thursday, November 14

November Meetup in Karlsruhe (Karlsruhe, Germany)

Developing Applications with Hadoop 2.0 and YARN (Dallas, TX)

BDNSHH November (Hamburg, Germany)

Saturday, November 16

Big Data Camp LA 2013 (Los Angeles, CA)