Data Eng Weekly

Hadoop Weekly Issue #73

08 June 2014

Hadoop Summit was this week in San Jose, so this week’s newsletter is full of lots of interesting technical content and news. I tried to capture as much as I could, but there is just an overwhelming amount. Enjoy!


Hivemall is a machine learning tool for Apache Hive. Implemented as Hive UDFs, it’s easy to test out (just add the jar to a hive session), contains a number of machine learning algorithm implementations (including several not found in other Hadoop libraries), and can do iteration without multiple MapReduce jobs. These slides from a talk at Hadoop Summit provide many more details including details on the implementation.

The Apache HBase community has recently implemented a lot of improvements to the mean time to recover (MTTR) for a failed region server. Facebook, who are large users of HBase, are looking to take this one step further by hosting each region in multiple region servers. A post on the Facebook blog describes the problem and the solution (called HydraBase) in more detail. Facebook plans to roll out the system internally soon, and hopefully it’ll inspire a similar solution in Apache HBase.

While few companies are running Hadoop clusters the size of Twitter (they have 100s of PBs in HDFS), it’s still really interesting to see and learn from their experiences. This post talks about the Hadoop 1 to Hadoop 2 migration at Twitter, including some of the scale and operational issues they ran into and solved as part of the migration. Overall, the migration seems to be a success—they’re seeing much better compute/memory utilization and are starting to see adoption of new frameworks on YARN like Spark.

The Kite SDK is a framework that aims to simplify a number of common Hadoop workflows through abstractions. For instance, there is a concept of a “dataset” which can be backed by HDFS or HBase and stored in various file formats (like Avro and Parquet). This post shows how to use some recently added command-line tools for the Kite SDK to load the contents of a delimited file into HDFS and HBase.

Spotify Data Engineer Adam Kawa tells the story of optimizing Hive in order to (quickly) find the answer to a question related to a bet with Spotify’s CEO (the stakes of which was meeting his favorite recording artist). The post walks through his experience with several of the latest Hive features, such as ORCFile, Hive-on-Tez, and vectorization. He also has details on sampling and writing unit tests for Hive queries.

There’s an interesting post on Quora comparing Apache Spark and Apache Stratosphere (incubating). The two projects aim to support a lot of similar use cases, butthere are some major differences in the research and fundamentals of the systems. There are answers from Databricks’ CTO and a Stratosphere project member.

Hortonworks has posted a Hive benchmark that compares Hive version 0.10 to version 0.13. Version 0.10 was the first release before the start of the “stinger initiative” which aims to speed up Hive by 100x. One of the most impressive stats about the benchmark was that the time taken by all queries went from nearly 8 days to 9.3 hours. The post has a lot of detail about the setup and configuration used.

Inserting data into Hive is always done at the partition (or the entire table) level, meaning you can’t do INSERT INTO .. VALUES .. or similar statements. A new project is working to bring ACID to Hive, which will add support for these types of insert, delete, and update statements. The implementation uses new Acid Input/Output formats that can read and write delta files describing updates to a table. Details on the implementation and how delta files are compacted are in the following slides.

JethroData, makers of an analytics database build on Hadoop, have posted a benchmark in response to the recent benchmarking done by Cloudera on Impala. They’ve recreated the test on a smaller cluster running in Amazon EC2 with a much smaller scale factor (15TB vs. 1TB), but the results are quite promising (with the same caveat as always about vendor benchmarks). In addition to the results, it’s interesting to see that someone has been able to use Cloudera’s Impala TPC-DS kit to run tests.

This post shows how to integrate Apache Flume and Apache Spark to do near-time processing of event data. It uses the Java API for Spark streaming to compute top-10 queries on a rolling window.

The Hortonworks blog has a post on the recently added Blueprints feature of Apache Ambari. Blueprints aim to implement best practices for cluster management and to make provisioning easier and repeatable. The post explains how blueprints work and some of the APIs available to inspect a cluster.

There were a lot of interesting talks at Hadoop Summit—way too many to cover in this newsletter. I’ve put together a bundle of all the slides that I’ve found from last week’s conference. There’s certainly a lot of interesting stuff happening in the ecosystem!


In a respite from the “Hadoop Wars,” big names from Cloudera and Hortonworks got on the stage at Hadoop Summit in a positive setting. It’s sometimes hard to remember that folks from these and other companies are collaborating on open-source software, so it’s great to see some positivity in public.

Concurrent, makers of Cascading, Lingual, Driven and other tools for Hadoop, announced a new round of funding totaling $10 million. VentureBeat has some quotes from CEO Gary Nakamura about how they’ll use the money and their business model.

This story has a lot of good observations about the Hadoop business and what it’ll take for Hadoop to gain enterprise adoption. While some companies got into Hadoop in order to offload ETL, SQL might attract even more. The article is also quick to point out that Hadoop has much more to offer than ETL and SQL.

On the heels of Hortonworks’ acquisition of XA Secure, Cloudera announced that they’ve acquired Gazzang. An article on GigaOm has some more details on the deal, Gazzang’s security products, how Cloudera will integrate Gazzang, and more.

Cloudera announced the end of their funding round and disclosed more details of the deal. Details include finance (of the $900M, $530 was ‘primary capital’) and a new Cloudera board member—Kimberly Stevenson (who is the CIO of Intel).

InfoQ has coverage of Hadoop Summit by way of recaps of days one and two. There are details on some of the keynotes as well as various talks throughout the days.

Datanami has a story on Yahoo’s use of Hadoop, and it’s spinout of Hortonworks. There’s mention of several products built on Hadoop at Yahoo, the close relationship between Hortonworks and Yahoo, and more.

AT&T and Continuuity announced that they are collaborating on a new stream processing tool called jetStream that will combine functionality of Continuuity’s BigFlow and AT&T’s streaming analytics tool. It sounds like jetStream, which will be open-sourced in Q3 2014, will have a lot of interesting features.

Databricks announced a new series of Spark workshops. Now through August, they’ll be holding training workshops in New York, San Jose, San Francisco, Austin, and Chicago.

MapR and Syncsort announced a partnership to bring Syncsort’s DMX-h ETL framework to MapR’s distribution.


Microsoft's Hadoop-as-a-service offering, HDInsight, was updated this week to support Hadoop 2.4. Microsoft notes that the new version includes major speedups (up to 100x) when querying data stored in the Azure Blob store.

A new Kiji Bento Box SDK was released this week. The "Ebi" 2.0.3 SDK includes new versions of all components in the Kiji framework.

Hue 3.6 was released with a new Search application. The tool, which uses Solr Cloud running on Hadoop, provides a number of fancy new interfaces for looking at data. The new release also brings improvements to support for snappy compression and Impala.

DataTorrent announced general availability of their stream-processing framework for Hadoop. There are a number of open-source streaming tools, but DataTorrent seems to be going after the enterprise with their feature set (which includes SLAs, Alerts, and more).

Altiscale announced support for Apache Hive 0.13 as part of the Hadoop-as-a-Service platform.

Hortonworks announced a preview of Apache Slider for HDP2. Slider is a system for deploying applications inside of YARN. The preview includes sample applications for Apache HBase, Accumulo, and Storm, and there is an integration with Apache Ambari.


Curated by Mortar Data ( )



June SF Hadoop Users Meetup (San Francisco) - Wednesday, June 11

SF: How to Integrate Hadoop with Systems, with Jemish Patel (San Francisco) - Thursday, June 12

AdvancedAWS: June Meetup (San Francisco) - Thursday, June 12

Big Data Camp LA (Los Angeles) - Saturday, June 14


Hey Hadoop, meet Apache Spark! (Salt Lake City) - Wednesday, June 11


Big Data Developer Day (Boulder) - Thursday, June 12


A Leap Forward for SQL on Hadoop (Coppell) - Tuesday, June 10

Houston Hadoop Meetup Series (Houston) - Wednesday, June 11


Doug Cutting, founder of Hadoop, on "The Future of Data" (Kansas City) - Tuesday, June 10


Chicago Apache Lucene/Solr User Group - Leveraging Solr for Big Data Insight (Chicago) - Tuesday, June 10

Introduction to Kiji & Real-Time Personalization with Hadoop, HBase & Cassandra (Chicago) - Thursday, June 12


Big Data Developer Day (Malvern) - Tuesday, June 10

Workshop for SQL on Hadoop (Pittsburgh) - Thursday, June 12

District of Columbia

The 1st Washington DC Area Apache Spark Interactive Meetup (Washington) - Thursday, June 12


Indexing Strategies on Apache Accumulo (Largo) - Wednesday, June 11

Conference: Accumulo Summit (East Hyattsville) - Thursday, June 12


Considerations for a Holistic Approach to Big Data Security and Privacy (Cambridge) - Tuesday, June 10

Advanced Analytics in Hadoop (Cambridge) - Wednesday, June 11


Impala Roadmap, with Eli Collins (Toronto) - Monday, June 9

Hands on Hadoop (Vancouver) - Tuesday, June 10


MapReduce (BigData) et Oracle Database 12c avec Kuassi Mensah (Paris) - Monday, June 9


Gary Short: From Zero to Hadoop (Cambridge) - Tuesday, June 10

Hands-on Clojure: Introducing Cascalog (Cambridge) - Wednesday, June 11


Introduction to Hive, HiveQL, Hive with Hadoop (Mumbai) - Wednesday, June 11