Data Eng Weekly

Hadoop Weekly Issue #59

02 March 2014

Apache Hadoop 2.3.0 was released this week. It’s the first release since Hadoop 2 was declared GA last October. This week has a number of technical articles from folks sharing details on their big data pipelines, which I always find interesting.


A post entitled “Analytics at Github” describes the evolution of the GitHub analytics stack from a Rails and Cassandra-based system to one that uses Kestrel, S3, and Hadoop to process data which is stored in Cassandra and served via rails. The post follows the repository traffic graphs feature, but it describes how the system is general purpose.

Among other things, YARN helps improve cluster utilization because, unlike with MRv1, Map and Reduce slots aren’t statically allocated. This presentation discusses HDFS architecture, the motivation for YARN, Tez and YARN (including how YARN uses CGroups), and wraps up with a detailed overview of using the YARN Capacity Scheduler for building multi tenant clusters.

Mortar has built a cheat-sheet for Apache Pig. The PDF gives a tour of Pig's data types and relational operators (like JOIN GROUP BY, etc). It also covers syntax tips, quick examples of writing and using Pig UDFs in Java/Python/Ruby, a number of builtin functions, and has example translations from SQL to Pig for several common queries.

Former Etsy engineer Dan McKinley presented on Scalding at Etsy. The talk describes how Etsy is using Scalding to build products and answer unanticipated big data questions. It walks throughout the history of Etsy’s analytics platform, which started with Cascading.JRuby in 2009 but was replaced by Scalding in 2013. The talk shows some example code in the two frameworks and explores runtime differences between Cascading.JRuby and Scalding.

Mike Driscoll, CEO of Metamarkets, has written an article full of good advice for implementing data pipelines. In the post, he elaborates on five tips: “Stay Close To The Source,” “Avoid Processed Data,” “Embrace (And Enforce) Standards,” “Let Business Questions Drive Data Collection,” and “Less Data Extraction, More API Action.” It’s a really good read about several of the big picture items important to anyone building data pipelines.

The Ooyala Engineering blog has a post with some tips for using Parquet and Scrooge with Apache Spark. The post walks through setting up a sbt project with the scrooge-sbt-plugin to generate Scrooge classes from Apache Thrift definitions and then how to read Parquet files for Scrooge classes inside of Spark. Since Spark uses the Hadoop APIs, much of the post is applicable to MapReduce as well.


The community vote is open for Hadoop Summit North America taking place in San Jose in June. The vote is open until March 14th.

MapR seems to be aggressively expanding in the Asia Pacific region. This week, they announced the appointment of a new country manager for Australia and New Zealand.

The Apache Software Foundation formally announced this week that Apache Spark has graduated to become a top-level project. Per the release, Apache Spark has seen contributions from over 120 developers and is in use at Cloudera, IBM, Intel, Yahoo, and other companies.

GigaOm has coverage of Cloudera’s Oryx project, which is a relatively new open-source project for doing machine learning. Oryx is based upon code from Myrrix, which Cloudera acquired in June of last year. There is some interesting insight in the post, including the fact that Oryx’s backend is being rewritten to use Apache Spark instead of MapReduce. The article also notes, though, that Oryx probably won’t be included as part of CDH anytime soon.


Apache Hadoop 2.3.0 was released. Of note, this version includes some exciting features like HDFS caching and heterogeneous storage (e.g. to store certain data on SSD rather than SATA drives). There are also a number of fixes and enhancements in the release. The Hortonworks blog has some more details about the release and a look forward to the upcoming 2.4 release.

Cloudera announced new versions of CDH 4, Cloudera Manager, and Cloudera Search. CDH 4.6 includes fixes and improvements to Apache HBase, HDFS, Flume, Oozie, MapReduce and YARN. Cloudera Manager 4.8.2 includes a handful of bug fixes, and Cloudera Search 1.2.0 adds support for MapReduce jobs via YARN.

Apache HBase 0.94.17 was released this week. The new release contains 35 maintenance fixes and improvements.

Rubydoop, a ruby library for Apache Hadoop, released version 1.1.3. This release includes two bug fixes.

DataStax, who offers enterprise software and support for Apache Cassandra, announced version 4.0 of DataStax Enterprise. The new version includes an in-memory database option and improvements for their Apache Solr-based search product. Alongside the release, DataStax also released new versions of their management software OpsCenter and the Java Driver for Cassandra.


Curated by Mortar Data ( )



Frontier Big Crowd on Big Data Cloud Brainstorming South Bay Series (Palo Alto) - Monday, March 3

Frontier Peninsula Series: Real-time Hadoop - See the Elephant Fly (Palo Alto) - Wednesday, March 5

Customer Case Studies of Big Data Analytics on Hadoop, by Karen Hsu, Datameer (Mountain View) - Thursday, March 6

BigData/Machine Learning Working Group Sponsored by Pivotal (Mountain View) - Friday, March 7

Big Data Analytics (Irvine) - Friday, March 7


Winter 2014 Seminar Series: Big Data Infrastructure (Tacoma) - Tuesday, March 4

Rescheduled Elasticsearch Meetup (Seattle) - Thursday, March 6


Elasticsearch 1.0 - Whats new and how are people using it? (Portland) - Tuesday, March 4


Advanced Hadoop Based Machine Learning (Austin) - Wednesday, March 5


Hands-on Hadoop with MapReduce, Hive and Impala (Milwaukee) - Wednesday, March 5


The Future of Data by Doug Cutting (Herndon) - Monday, March 3

Securing Hadoop (Herndon) - Thursday, March 6


Doing Data with Go and Clojure (Columbia) - Wednesday, March 5

New Jersey

Cowerkshop - "Hadoop - Introduction to processing Big Data" (Asbury Park) - Saturday, March 8


Monthly Solution Architect Scrum (Toronto, ON) - Wednesday, March 5


MongoDB and Hadoop (Stockholm) - Monday, March 3


Bangalore Hadoop Meetup (Bangalore) - Saturday, March 8

Machine Learning Course (Bangalore) - Saturday, March 8