Data Eng Weekly

Hadoop Weekly Issue #34

08 September 2013

There are nine(!) Hadoop User Group meetups across four continents this week (not to mention the four other Hadoop-related events). If you had any doubt that Hadoop is taking off, then hopefully these events erase the disbelief.

There were quite a few interesting bits of news this week -- the release of Summingbird from Twitter, Cloudera Search hitting GA, and some updates from Hortonworks on the Stinger Initiative for Hive. And lots more great content - enjoy!


In August, the NYC Pig User Group hosted Donald Miner, CTO of ClearEdge IT Solutions, who gave a talk on when you should use Pig vs. MapReduce. The talk covers a lot of ground -- e.g. when/why you should use Pig, development productivity with Pig, Pig UDFs, and things that are hard to do with Pig. The Mortar blog has a quick summary, the video, and the slides.

The HDP 2.0 beta from Hortonworks has some new Hive features that are not yet in a Apache Hive release. The new features -- including vectorization and an improved query optimizer -- are part of the Stringer initiative, which aims to increase Hive performance by 100x. The post covers the progress of the Stinger initiative and highlights some of the new features and example performance gains.

In another post on Hive performance in HDP 2.0 beta, the Hortonworks blogs talks about two benefits of the new ORCFile format first introduced in Hive 0.11. The first benefit is data size -- storing synthetic data from TPC-DS dataset as ORCFile reduces file size from 585GB in Text to 131GB (a savings of 78%). The second benefit is the ability to skip large portions of data when only using certain columns (predicate pushdown). The post also covers how to use ORCFile in Hive.

Revolution Analytics provides an enterprise software suite for R, including a distributed computing framework (DistributeR) and a suite of parallel algorithms for R (ScaleR). The upcoming version 7 of Revolution R Enterprise will support running ScaleR on Hadoop MapReduce. These slides (and there's a link to a video of the presentation) provide some details on the Revolution R + Hadoop integration.

This is one of the better introductions to Hadoop that I've seen. The talk covers core Hadoop, the Hadoop Ecosystem, use cases, as well as the up and coming parts of the ecosystem -- YARN, Storm, Spark, and Apache Giraph. The talk has a number of detailed and useful architecture diagrams covering several parts of the ecosystem.

I recently gave a talk at the NYC Data Engineering Meetup where I spoke about workflow engines for Hadoop -- covering Apache Oozie, Azkaban (from LinkedIn), and Luigi (from Spotify). Having deployed each of these (in various versions) to production, I try to give an objective overview of their strengths and weaknesses -- and I give my opinion about what works best in the last few slides.

This video from the July Mesos Meetup covers Hadoop on Mesos at Airbnb. In addition to Hadoop itself, Airbnb's workflow engine, Chronos, runs on Mesos.


Merv Adrian of Gartner has an insightful post about Hadoop terminology -- in particular the use of the term "proprietary." The main point is that "proprietary" is an over-loaded term in the Hadoop industry -- it's used when speaking about closed-source components, open-source components that aren't from the ASF, and also ASF projects that have been modified. Merv suggests using a new term -- "distribution-specific" -- which is more specific in meaning.

SyncSort, who builds an ETL product for Hadoop, has partnered with Tableau. The companies' tools integrate to provide a complete raw-data to visualization system.


DataFu, the Pig UDF toolchain, has reached version 1.0. It has gained a lot of new features since its initial release, and it is included in CDH as well as Apache BigTop. The LinkedIn engineer blog features a post celebrating the 1.0 release as well as walking through a number of use cases (with code and analysis). If you're using Pig, you should probably be using DataFu's UDFs or Java API to help you write better UDFs.

Summingbird is a new framework open-sourced this week by Twitter for running MapReduce jobs in batch via Scalding or in real-time via Storm. With Summingbird, the same code can be used for hybrid computation, allowing for real-time computation with Storm and course-correction via Scalding/MapReduce. The Twitter blog has a lot more details about the framework and future plans for it.

Oozie 4.0.0 was released with over 200 resolved JIRA issues. Among the new features in this release are SLA support (emails if a task takes longer than an expected amount of time) and integration with HCatalog (which publishes JMS messages when new partitions /datasets are created).

Cloudera announced Cloudera Search 1.0, Cloudera Manager 4.7, Cloudera Navigator 1.1, and CDH 4.4. In addition to marking the GA of Cloudera Search, the theme that stands out across the releases is that of security -- Cloudera Manager gets support for Sentry, Sentry is integrated into CDH 4.4, and Cloudera Navigator adds auditing support for Impala and tracking of "denial of privilege events" for Hive.

Spring for Hadoop 1.0.1 GA was released. It supports Apache Hadoop 1.2.1, Apache Hadoop 2.0.6-alpha and Apache Hadoop 2.1.0-beta (and CDH, HDP, and Pivotal). Also, a new milestone release of Spring for Hadoop 2.0, which adds support for YARN, was released.

CDK 0.7.0 was released with a more consistent Dataset API, several new features in the Morphlines library, and support for Java 7 and Avro 1.7.5.

The glusterfs-hadoop project released version 2.1. The project is Apache-licensed, and it provides a Hadoop FileSystem implementation for the GlusterFS. The post includes information about configuring MapReduce to use GlusterFS (it is also being integrated with Apache Ambari for easier configuration).


Curated by Mortar Data ( )

Monday, September 9
Cleveland Big Data and Hadoop User Group

Tuesday, September 10
HUG UK September Meetup (London, UK)

Tuesday, September 10
Real-time Analysis with Storm and Beacon (Durham, NC)

Tuesday, September 10
São Paulo HUG - Introduction to Data Science (São Paulo, Brazil)

Tuesday, September 10
How to Create a Big Data Business (Bangalore, India)

Wednesday, September 11
September Downtown Meetup with Arun Murthy from Hortonworks (Chicago, IL)

Wednesday, September 11
September SF Hadoop Users Meetup (San Francisco, CA)

Thursday, September 12
Apache Drill, Popular Hadoop MapReduce Tools (Stockholm, Sweden)

Thursday, September 12
Hadoop for Beginners (Atlanta, GA)

Thursday, September 12
Solr + Hadoop = Big Data Search (New York, NY)

Thursday, September 12
Solr+Hadoop: Adding Search to the Big Data Ecosystem (Los Angeles, CA)

Thursday, September 12
Strategic Career Positioning with Hadoop and Big Data (Vancouver, BC)

Sunday, September 15
Hive Deep Down (Sunnyvale, CA)