Data Eng Weekly

Hadoop Weekly Issue #71

25 May 2014

Articles in this week’s newsletter cover a couple of themes that have been emerging recently in the Hadoop ecosystem. First, Apache Storm continues to see adoption for production workloads (whereas I’ve yet to see many serious deployments of newer tools like Spark streaming). Second, Hadoop in the cloud is starting to gain traction (and will likely accelerate as light-weight virtualization and the cloud price wars take off). There are a lot of good articles covering these topics and more in this week’s issue.


kafka-storm-starter is a repository containing an example integration between Kafka and Storm for stream processing. It uses Avro for serialization, and the code base contains both Kafka and Storm standalone code examples, example unit tests, and example integration tests. The README has a lot of details on the implementation, on setting up a development environment, and much more.

“Compiled Python UDFS for Impala” explores the details of the LLVM-based Python UDFs for Cloudera Impala that can be built with the recently released impyla project. The talk details the LLVM intermediate output format, some example impyla code for building llvm UDFs, and gives a performance comparison between impyla and PySpark.

Informit has posted the preface to the recently-released "Apache Hadoop YARN” by Arun C. Murthy, Vinod Kumar Vavilapalli, Doug Eadline, Joseph Niemiec, and Jeff Markham. The preface briefly tells the story of Hadoop, the Hadoop community, and the reasons for and goals of YARN.

Big Data & Brews has a “How It Works” episode that includes architecture overviews by folks from Concurrent, MapR, and Pivotal.

Flickr has recently migrated an image classification system from OpenMPI based system to a hybrid batch + online system using Apache Hadoop and Storm. Hadoop was used for building a classifier using a training data set, and it can also process the entire flickr corpus using 10,000 mappers in about 1 week. For real-time updates, they use a 10-node Storm cluster pulling from redis. The slides from the presentation have much more detail, and there’s a link to the video of the talk in the description.

HBase added support for a new DataType API as part of HBase 0.95.2. Prior to that, all interaction with HBase was using byte arrays. This talk goes through the motivation and API of the new DataType API. It also has some examples of using the TypedTuple, Structs, and Protobuf serialization APIs.

GigaOm has an interview with Cloudera’s VP of Engineering Peter Cooper-Ellis about Cloudera’s datacenter footprint and deployment setup. They deploy on bare-metal (for benchmarking, PoC, etc), internal clouds (for core products like building and packaging), and public clouds like Amazon (for product development, testing, and debugging). Cloudera has to test out lots of different configurations in order to certify and support third-party integrations. The article also notes that optimizing Cloudera products for cloud environments is one of the engineering teams top initiatives for 2014.

Folks tend not to run Hadoop in cloud environments due to performance and cost reasons, but I’m starting to see more evidence of companies doing so—particularly when it comes to HBase. This talk is explains the various details of Pinterest’s HBase deploy in the Amazon cloud, which serves terabytes of data. It talks about schema design, load testing, tools, monitoring, alerting, and more.

The Cloudera blog has a post showing how to convert data from Avro to Parquet. It includes two examples, the first using the Java APIs to write out data single-threaded. The second shows how to write a MapReduce job to do the conversion. It also has some details on tweaking compression and other configuration parameters.


Hortonworks and BMC announced a partnership to bring BMC’s Control-M for Hadoop to the Hortonworks Data Platform. Control-M is a workflow automation tool which supports HDFS, Pig, Sqoop, Hive, and MapReduce.

Computing has an interview with Cloudera Chief Strategy Officer Mike Olson on the recent Cloudera-Intel deal. He defends the deal, which has been criticized as positioning Cloudera to compete with established DW vendors like Teradata and Oracle, and accepts blame for not explaining their positioning and strategy better. He also defends Cloudera’s stance on open-source (their distribution contains proprietary components), and speaks more about the benefits of the Intel deal.

The Parquet project was accepted into the Apache Incubator. Co-founded by Twitter and Cloudera, Parquet is a columnar storage format built for Hadoop. The format is supported by a number of projects and commercial distributions. The Cloudera blog has more details and a list of posts written about Parquet.

Databricks and Pivotal announced a partnership to support Apache Spark on Pivotal HD 2.0.1. The news was announced on the Databricks blog in a post by Pivotal. The post contains links to downloads and documentation for getting started with Spark on Pivotal HD at the end.

MongoDB and Hortonworks announced that MongoDB is now a Hortonworks Certified Technology Partner. In technology terms, the MongoDB Hadoop Connector has been certified for HDP 2.1. According to the announcement, extensive review and testing was done as part of the certification, and there is detailed documentation linked to from the post.

InformationWeek has an article on the recently-announced public beta of Splice Machine’s SQL-on-Hadoop product. The article has some additional details on the implementation, which marries Apache Derby and HBase (from the 0.95 series). There’s a testimonial from Harte Hanks about their plan to replace a Oracle RAC system with Splice Machine.

The GigaOm Structure podcast hosted Altiscale co-founder and CEO Raymie Stata, who was formerly CTO of Yahoo during the period at which Hadoop was just getting started. The GigaOm website has a summary of the podcast, which covers Hadoop as a Service, some thoughts on the commercial Hadoop industry, Spark, search, and more.

The Optiq project was accepted into the Apache Incubator. Optiq is a system to allow SQL-access to data stored in heterogenous data stores including an advanced query optimizer. It’s in use by several Hadoop ecosystem projects including Apache Drill, Hive, and Cascading Lingual.


Apache Flume 1.5.0 was released. This release adds a number of new features including a SpillableChannel (for spilling to disk when in-memory buffer fills up) and an Elasticsearch Sink. The release also contains a large number of documentation improvements, bug fixes, and general codebase improvements.

Version 0.14.1 of the Kite SDK was released this week. The release contains a number of bug-fixes, including dataset example fixes and a fix to the Kite Maven plugin.

Amazon Elastic MapReduce released a new version of their Hadoop 2 AMI with support for Hadoop 2.4.0, Pig 0.12, Impala 1.2.4, HBase 0.94.18, and Mahout 0.9.

Version of the Microsoft .NET API for Hadoop was released. The software uses the Hadoop HTTP APIs to communicate with Hadoop clusters, and it is available in the nuget Package Manager Console.

Google has open-sourced the Google Cloud Storage Connector as part of the “bigdata-interop” Hadoop tools for the Google Cloud Platform. The code is Apache licensed.

Parquet 1.5.0 was released. The release adds statistics to Parquet pages and row groups, adds fixes for column pruning, bumps the protobuf dependence to version 2.5.0, and more.

FullContact has open-sourced their SSTable InputFormat for Hadoop. The library provides offline access to Cassandra data for MapReduce. The SSTable InputFormat can split input data for better parallelism.


Curated by Mortar Data ( )



A Deep Dive on Amazon Kinesis for Real-time Stream Processing (Irvine) - Wednesday, May 28

Pepperdata Meetup: Best Practices for Large-Scale Distributed Systems (Sunnyvale) - Wednesday, May 28

Leveraging Vertica for Analytics in Hadoop/NoSQL environment (San Ramon) - Wednesday, May 28

Beyond MapReduce [Clash of The Titans Series] (Sunnyvale) - Thursday, May 29

How LinkedIn Uses Scalding for Data Driven Product Development (Mountain View) - Thursday, May 29

Washington State

Fun Things You Can Do With Spark 1.0 with Paco Nathan [Special Event] (Seattle) - Tuesday, May 27

Spring 2014 Seminar Series: Big Data Infrastructure (Tacoma) - Wednesday, May 28


This Ain't Your Father's Search Engine (Saint Paul) - Thursday, May 29

North Carolina

May CHUG: Nikhil Kumar (SyncSort) on Converting SQL to MapReduce (Charlotte) - Wednesday, May 28

New York

Design Framework for Big Data Computing (New York) - Wednesday, May 28

Big Data Topic: Applying Testing Techniques to Hadoop Development (New York) - Thursday, May 29

Inside Look: How Next Big Sound Tamed the Hadoop Ecosystem for Music & Publishing (New York) - Thursday, May 29


Monthly Meetup - Open Sessions (Toronto) - Thursday, May 29


Hadoop vs Spark (Tel Aviv-Yafo) - Tuesday, May 27


23rd Brussels Datascience Meetup (Ghent) - Tuesday, May 27


Simplifying Application Development on Hadoop (Berlin) - Monday, May 26

Meetup #14 with Ted Dunning and Sebastian Schelter (Berlin) - Thursday, May 29


Big Data Conference (Mumbai) - Friday, May 30

How YARN Made Hadoop Better (Hyderabad) - Saturday, May 31