Data Eng Weekly

Hadoop Weekly Issue #70

18 May 2014

Yahoo announced their support for Hive and Tez this week in the widely contested SQL-on-Hadoop market. Meanwhile, there is an interesting overview of a real-world use-case at Allstate with Cloudera’s SQL-on-Hadoop system Impala. There are also plenty of interesting technical articles and exciting announcements—including the public availability of Splice Machine’s RDBMS on HBase product and a native implementation of MapReduce that’s been open-sourced by Intel.


The Cloudera blog has an article about a change in the way that Oozie manages its shared library directory in HDFS. The changes add support for multiple versions of the directory which fix a race condition. The post explains the changes and tooling around it.

There have recently been a lot of benchmarks in the SQL-on-Hadoop arms race. Those benchmarks tend to use synthetic data, though, so it’s interesting to hear about systems running on real-world datasets. In this case, data from Allstate Insurance is analyzed using Impala. What starts as a 2.3TB csv file compresses to 106GB when encoded using Parquet, and the derived performance improvements on a 35-node cluster are impressive. The article has full details, including some tips for someone looking to tackle a similar dataset.

I once looked into writing a new convertor for Parquet in order to write out a custom data type. I wish that I had this excellent guide, which walks through the various pieces of the Parquet write API. In addition to the written analysis, the post has excellent diagrams showing the different layers as well as how data is stored on disk.

A tutorial on the Cloudera blog shows how to process stock market data stored in an Apache Avro file using Apache Crunch. It covers how to implement a secondary sort to group by day and sort by stock symbol within the day.

While both aim to crunch massive amounts of data, Hadoop and HPC overlap very little—both in terms of community and technology/deployments. This in-depth article explores why the HPC community doesn’t adopt Hadoop (it’s an invader, it looks funny, it’s a poor reinvention of HPC technologies, and more), and suggests some changes that could be made on both sides to break down the divide. Given the massive amounts of money HPC organizations spend, I wouldn’t be surprised if we start seeing vendors address the problems.

Yahoo’s Hadoop Platforms Team has written a post about Apache Hive, Tez, and YARN at Yahoo. The article explains the rise of Hive at Yahoo (it sounds like it’s gaining ground on Pig, especially in the past 6 months), why Yahoo is betting on Hive over various other SQL-on-Hadoop solutions, explains some of the query performance that they’ve seen at Yahoo (and how Shark wouldn’t work on a 100-node cluster in their tests), and more.

Episode 22 of the All Things Hadoop podcast features an interview with Patrick Hunt discussing Apache Solr. The episode covers backing Solr by HDFS, SolrCloud, Cloudera Morphlines, and more. The article has a great summary of some of the technical details.


IBM is shipping a new version of their distribution, InfoSphere BigInsights 3.0 in a few months. One of the most notable features of this upcoming release is Big SQL version 3.0, which aims to be a drop-in replacement for existing RDBMs. The update will include SQL 2011 compatibility, stored procedures, data federation (i.e. pulling back data from other services like DB2), security enhancements, and better performance.

Concurrent and DataBricks have announced a partnership to build an integration between Cascading and Apache Spark. Datanami has an interview with Concurrent founder and CTO Chris Wensel about the push to integrate Cascading with multiple backends (the first of which was Tez). He notes that choosing the right backend is all about trade-offs.

Since the Cloudera-Intel deal, IBM is one of the few (the only?) distributions from a hardware maker. While most Hadoop clusters run on standard Linux systems, IBM’s distro is optimized to run on both x64 Intell processors and 64-bit RISC Power architectures. Datanami has a story about IBM’s Power8 and Hadoop. It covers some of the advantages of the chip vs Intel IvyBridge (e.g. memory, io, and cache) which can ultimately lead to cost savings.

AltiScale offers a Hadoop as a Service that aims to be more efficient than other systems like Amazon’s Elastic MapReduce by not relying on virtualization. To do so, Qubole runs bare-metal servers in its own data center. While performance is improved, you must go through the process of transferring your data out of the cloud. This article has more details on the trade-offs and how Altiscale’s offering is somewhere between EMR and running your own Hadoop cluster.

Hortonworks announced that they’ve acquired XA Secure, provider of security tools for Hadoop. The XA Secure software includes centralized policy management, fine grained access control, auditing, and encryption. Hortonworks plans to open source the software and incubate it in the Apache Software Foundation. They hope that the process will begin in the second half of the year, and they will offer it in binary form until then.

Zettaset’s flagship project, Zettaset Orchestrator, is management software and tools built for security. An interview with Zettaset’s CEO explains how the company thinks about enterprise-grade security for Hadoop, and it discusses some of the software that the company has built.

MapR has released a Sandbox VM with HP Vertica running atop of MapR. A post on the MapR blog introduces the VM and talks about some of the use cases and technical details of the MapR-Vertica integration.


Splice Machine announced public availability of its RDMBS that runs atop of Hadoop and HBase. Unlike other solutions, Splice Machine’s product includes support for online transaction processing and is ANSI SQL 99-compliant. An article on Datanami has more details about the product, which is set to hit version 1 later this year.

Version 0.14.0 of the Cloudera Kite SDK was released. This version includes improved documentation, view support for the MR and Crunch libraries, bug fixes, and several other enhancements.

Cloudera Enterprise (CDH and Cloudera Manager) version 5.0.1 were released this week. The 5.0.1 release includes a new version of Impala (1.3.1) and several bug fixes to Cloudera Manager, and CDH components. In addition, version 4.8.3 of Cloudera Manager was released with several bug fixes and improvements.

Version 0.10.0 of Scalding, the Hadoop library written in Scala powered by Cascading, was released. This version upgrades cascading dependencies and includes a handful of improvements.

A team at Intel has been working on a native implementation of the MapReduce Map Output Collector for several months. It’s called NativeTask and was open-sourced this week. In addition to being a drop-in replacement for many existing Hadoop jobs, the framework also supports native mappers and reducers written in C++. For the classic WordCount example, the framework provides a 2.6x speedup.


Curated by Mortar Data ( )



What is the big idea with ZooKeeper? by Jan Gelin of Rubicon Project (Los Angeles) - Monday, May 19

Techtalk v3.0 - Analytics in Cassandra and Hadoop + InfiniDB (San Francisco) - Thursday, May 22


Big Data, from technology to business potential (Bellevue) - Wednesday, May 21


Learn what's new in Mahout with Ted Dunning (Boulder) - Wednesday, May 21


St. Louis Hadoop Users Group Meetup (St. Louis) - Tuesday, May 20


Got data? Join us for a tech talk from Josh Wills, Data Scientist - Cloudera (Southfield) - Thursday, May 22


Abacus presents Hortonworks Hadoop and RedHat (Atlanta) - Tuesday, May 20

New York

Intermediate Workshop II: Writing Spark Applications (New York City) - Thursday, May 22


The Data Operating System for Hadoop 2.0 - YARN Tech Talk (Cambridge) - Tuesday, May 20


Kick-off meetup: Intro to Apache Spark by Databricks (Vancouver, B.C.) - Thursday, May 22


HBase User Group London: Types in HBase & Apache Gora (London) - Monday, May 19


Let's Get Hands-on with Big Data! (Nairobi) - Monday, May 19


Big data: HBase for the Architect, by Nick Dimiduk (Paris) - Tuesday, May 20


Mind the Stack - Architecture stories by Israeli startups (Tel Aviv) - Tuesday, May 20


Charla/Taller de Big Data Open Source (Madrid) - Thursday, May 22