Data Eng Weekly

Hadoop Weekly Issue #96

16 November 2014

Big news this week out of Palo Alto as Hortonworks has filed paperwork for an initial public offering. There were also a number of notable releases this week, including Apache Hive 0.14.0. Technical posts cover a large number of ecosystem topics, including Apache Sqoop, Apache Drill, and Apache Pig. There’s a lot of breadth in this issue, so there should be something for everyone!


The Cloudera blog has a guest post from Cerner about integrating Apache Kafka with HBase and Storm for real-time processing. The post describes how adopting Kafka helped reduce load on HBase (which was previously used for queuing) and improve performance. This style of Kafka-based architecture seems to be more and more common, but it’s always interesting to hear how folks are putting together the pieces of the Hadoop ecosystem.

The MapR blog has a post on using the recently-released Apache Drill 0.6.0-incubating to analyze Yelp’s public data set. The data, which is a JSON file, can be queried directly via SQL in Drill without first declaring the data’s schema (drill auto-detects it). The post has a number of sample queries which you can use to get started analyzing this or any other data set.

The Cloudera blog has a second guest post, this time from Dell, on the new Oracle direct-mode in Sqoop 1.4.5. The post describes several of the implemented optimizations in the Oracle direct mode and includes an analysis of performance improvements the connector provides.

The Hortonworks blog has a post on using Apache Pig with the Python Scikit-learn package in order predict flight delays using logistic regression and random forests. The post is a bit light in details, but there is a linked IPython notebook which has a very detailed overview and description of the entire process. Given that Python is often a data scientist’s top choice for machine learning on small data sets, it’s useful to see how to extend it to larger data sets with Pig.

The blog has a post on Sqoop1 support for Parquet, which leverages the Kite SDK to generate Parquet files during import. The post serves as a good introduction to Sqoop1, which can both import data to HDFS and update the Hive metastore with information about the data. There are examples demonstrating how to use Parquet support.

Tephra is a open-source system that provides globally-consistent transactions for Apache HBase. Cask, the makers of Tephra, have written a blog post describing the requirements and design of Tephra. Tephra is designed in such a way that it can be used with systems other than HBase, and it is even designed to support transactions spanning multiple data stores.

This presentation focusses on Spark streaming, the micro-batch component of Apache Spark. The slides give an introduction to both Spark and Spark streaming, describe several use cases (claiming there are 40+ known production use cases), give an overview of several integrations (Cassandra, Kafka, Elastic Search, and more), and look ahead to some upcoming features and improvements in the development pipeline.


Hortonworks has filed paperwork for their initial public offering this week. The filing includes a number of details on the company, including financial numbers ($33.4M in revenue so far in 2014), an overview of key company milestones, and number of employees (524 at the end of September). GigaOm has an analysis of some of these numbers and an overview of what the IPO means for the rest of the industry.

IBM’s Big Data for Social Good Challenge opened this week. The challenge includes $40k in prizes, which will be awarded by a panel composed of IBM and industry experts. IBM has a curated list of datasets which can be used as part of a challenge entry.


Apache Drill 0.6.0-incubating was recently released. 0.6.0 is the second beta release, primarily containing bug fixes. Notable new features include ANSI SQL support for MongoDB, partition pruning, and (alpha) window function support.

Cubert is a new open-source tool from LinkedIn for writing high-performance MapReduce jobs. It’s a new language on the same level of Pig or Hive (sharing some resemblance to Pig) as well as a novel storage format/layer called blocks. For statistical calculations, graph computations, and OLAP cubes, Cubert offers impressive performance improvements. There’s a lot more information in the introductory blog post.

Apache Hive 0.14.0 was released this week. The release resolves over 1,000 (!) Jira issues. I’m sure we’ll soon hear more details about the release in blog post form but some quick highlights include: support for insert/update/delete with ACID support, a cost-based optimizer, support for data stored in Accumulo, support for HBase snapshots, and many improvements to ORCFile and HiveServer 2.

Pivotal Cloud Foundry (CF) has added support for deploying Cassandra via DataStax Enterprise. The blog post introducing the feature has many more details as well as an example of setting up a cluster.

Version 0.4.1 of the Spark Job Server has been released. The new version supports Spark 1.1.0 and has improvements for deployment/configuration.

Microsoft released version 2.5 of the Azure SDK and a preview of Visual Studio 2015. The releases contain support for HDInsight (the Hadoop as a Service component of Azure) including a Hive query editor and job viewer.


Curated by Mortar Data ( )



Data Exploration in Spark (San Francisco) - Tuesday, November 18

Getting Started with Spark and Scala, by Paul Snively of Verizon OnCue (El Segundo) - Tuesday, November 18

OCBigData Monthly Meetup #7 (Irvine) - Wednesday, November 19

49th Bay Area Hadoop User Group Monthly Meetup (Sunnyvale) - Wednesday, November 19

HBase Meetup @ WANdisco (San Ramon) - Thursday, November 20


Unlocking Your Hadoop Data with Apache Spark and CDH5 (Seattle) - Wednesday, November 19


MapR Presents Apache Drill: Self-Service Data Exploration (Portland) - Wednesday, November 19

Apache Spark: Setup, Overview, and Comparison (Portland) - Wednesday, November 19


Scalable In-Hadoop ETL Execution: Pentaho's Visual MapReduce (Overland Park) - Wednesday, November 19


Securing the Hadoop Cluster (Saint Louis) - Tuesday, November 18


Hadoop Like a Champion! (Austin) - Tuesday, November 18

Spark and Cassandra: Building and Deploying an Application (Austin) - Thursday, November 20


Hadoop Lunch at Adobe (Lehi) - Thursday, November 20


Hadoop Tutorial: Map-Reduce on YARN, Part 1 (Sterling) - Saturday, November 22


Understanding the Foundations of Hadoop (Philadelphia) - Tuesday, November 18

North Carolina

Triangle SQL Server UG Meeting (Raleigh) - Tuesday, November 18

Automating Customer Intelligence Management in Hadoop (Charlotte) - Wednesday, November 19

When to Use Pig instead of Hive (Winston Salem) - Thursday, November 20

New Jersey

YARN + Docker Containers: Integration and Privilege Isolation (Hamilton Township) - Wednesday, November 19

New York

Privilege Isolation in Docker Containers (New York) - Thursday, November 20


SQL on Hadoop: Hands-on (Boston) - Wednesday, November 19


November 2014 Hadoop Meetup (London) - Monday, November 17


Analyzing Real-World Data with Drill, Hadoop & MongoDB | Tomer Shiran, MapR (Singapore) - Monday, November 17


Apache Cassandra, Apache Spark, and Hadoop Meetup (Munich) - Tuesday, November 18

Patrick McFadin Talks C* & Spark for Time Series, plus A Leap Forward for SQL on Hadoop (Berlin) - Wednesday, November 19


Patrick McFadin Talks Cassandra, Spark, Tips and Tricks (Amsterdam) - Friday, November 21


Big Data Meetup, ApacheCon Edition (Budapest) - Tuesday, November 18


Drilling in on SQL and Hadoop (Melbourne) - Wednesday, November 19


Databricks Comes to Barcelona (Barcelona) - Thursday, November 20


Big Data Meetup (Bangalore) - Friday, November 21

Hadoop Workshop (Hyderabad) - Saturday, November 22