Data Eng Weekly

Hadoop Weekly Issue #86

07 September 2014

While last week’s issue had posts covering a few common themes, this week’s issue has content for a wide number of topics. Those topics include: Spork (Pig on Spark), Hive (specifically the new initiative), and Presto. There is also some interesting news from established enterprise companies—Teradata has acquired Think Big Analytics, and Cisco has released management and monitoring software for Hadoop.


A project called “Spork” has been working to build a Spark-based execution engine for Pig. This post gives an overview of how that implementation was built, describes how to get started (it’s as simple as pig -x spark), and what the current status is (passing 100% of tests, but not yet merged to Pig-trunk).

Hortonworks has announced “,” a continuation of the Stinger Initiative (which had a goal of speeding up Hive by 100x). The project has three goals—speed improvements to support sub-second queries, improved scalability, and improved SQL support for transactions and SQL-2011’s analytic functions. There are a few other enhancements planned, too, like materialized views and streaming ingestion of data. The project is split into three phases, which will deliver in 2H 2014, 1H 2015, and 2H 2015.

This post is a good introduction to Spark’s RRD APIs. It looks at how to translate some mapper and reducer functions from a traditional MapReduce implementation. Some concepts translate directly, but you’ll quickly see several new methods like reduceByKey, groupByKey, and flatMap. The post also shows that a simple flow becomes much more complex if you need to do some setup or teardown (a la the MapReduce APIs setup() or cleanup()).

This blog series is an excellent overview and introduction to Spark, Pig, Hive, and Cloudera Impala. For each, it gives a brief introduction to the computation model and features of the framework. The coverage of Hive includes a walkthrough of new features from the Stinger Initiative—ORCFile, query planner improvements, Tez as a backend, and vectorization (there’s quite a good technical overview of each). The coverage of Impala is also quite interesting—it enumerates several reasons that Impala tends to be faster than Hive (e.g. JVM GC overhead, Impala is better at pipelining, pull vs. push of intermediate data, and more).

This post describes integrating recommendations built by Mahout with a search engine to solve the cold-start problem (i.e. how to recommend to a new user). Using some preferences collected from during registration, the system queries item-similarity data stored in the search engine to create recommendations.—part-2

Qubole’s Presto-as-a-Service exposes Presto, which is a SQL-on-Hadoop system from Facebook, as a hosted service. It operates on data stored in S3, which means that data isn’t local to compute nodes. In order to get better speed, Qubole has architected a caching layer for Presto which supports both in-memory and SSD-based caches. This post explains the implementation, which uses consistent hashing and takes advantage of the kernel’s file caching (rather than building their own in-memory store). The post also has some experimental results, which show a 10-15x speedup by enabling caching and switching to ORCFile (from text).

With MapReduce on YARN, there’s no longer a long-lived JobTracker. For inspecting job histories, the MR Job History Server was introduced. Recently, a more general-purpose implementation called the Timeline Server was conceived. The Timeline Serve supports frameworks other than MapReduce, such as Tez (Spark and MR support are on the way). This post includes an introduction to the Timeline Server, including an overview of the data it exposes (which is much richer than the MR Job History Server).

This post serves as a good example of how easy it is to inadvertently write a poor MapReduce job, even when using a higher-level framework like Hive. Specifically, the post describes the steps taken to discover that a Hive query was inadvertently doing a full cross-product. It also mentions how you might identify the cross-product from the EXPLAIN query plan output.

SequenceIQ recently announced Periscope, a system for auto-scaling YARN clusters to meet SLAs. This post introduces some of the terminology for Periscope—alarms and metrics. It describes some of the metrics, e.g. PENDING_CONTAINERS and LOST_NODES, how to build an alarm as using a REST API call, and how to set a scaling policy using the REST API.

This post gives a high-level overview of the steps needed to deploy Hadoop for ETL. It discusses using Apache Flume and Apache Sqoop to bring data into Hadoop, introduces the concept of “schema-on-read,” provides suggestions for which frameworks to use to do ETL, and gives an intro to workflow management (suggesting that Apache Oozie is often insufficient).


Databricks, the company founded by the team originally behind Apache Spark, have produced an infographic highlighting some of the progress made by Spark in the past year. It highlights how far the project has come in a year, particularly in terms of community growth (e.g. Spark 0.7 had 31 contributors, Spark 1.0 had 117).

This post describe five influential Google projects that have spawned open-source equivalents. The five projects are Google MapReduce (Apache Hadoop MapReduce), Bigtable (HBase), Borg (Mesos), Chubby (Zookeeper), and Dremel (Drill).

Infobright and MapR announced a partnership for joint deployment of MapR’s distribution and Infobright’s analytics platform.

DataStax, who sells commercial software for Apache Cassandra, made news this week with a giant Series E round of financing. The deal brings in $106 million (bringing the total investment to $190 million) and values the company at $830 million.

Teradata has acquired Think Big Analytics, a big data enterprise services company. Notably, Think Big Analytics helps companies integrate Hadoop and NoSQL with existing technologies. Thus, industry pundits are seeing this acquisition as Teradata embracing the non-enterprise DW big data ecosystem.


Amazon Web Services announced support for Hive 0.13.1 for their Elastic MapReduce (EMR) Hadoop-as-a-Service offering. In a secondary announcement, EMR has removed the 256-step limit from EMR clusters as part of the AMI 3.1.1.

Hivemall version 0.2 was released. It’s a stable release of the machine learning library for Hive, which is built using Hive User-Defined Functions. The 0.2 release follows five pre-releases, and a new 0.3 beta is also available. Hivemall supports functions for Classification, Regression, item-similarity, k-nearest neighbor, and feature engineering. It requires Hive 0.11 or later.

Version 1.1.1 of hbase-client, the async HBase client for Node.js was released. The release supports HBase 0.94.x.

Cisco has announced UCS Director Express for Big Data, which is a automation and configuration tool for Hadoop services. It also provides monitoring of physical hardware alongside of Hadoop services.

The folks at SequenceIQ have published Docker images for running Apache Phoenix (if you’re not familiar, Phoenix is a SQL engine that runs atop of Apache HBase). The post describes how to launch a container running Phoenix 4.1 on HBase 0.98.5, create some tables, and connect to the database using both the sql shell and using JDBC.

WANdisco released version 1.9.6 of Non-Stop Hadoop. The new version adds zoning (virtual clusters or sub-clusters) and support for rolling upgrades.

Qubole, who offers Presto (the SQL-on-Hadoop) as a service, has added a new auto-scaling feature. By inspecting statistics kept by Presto, the system determines when the cluster us under- or over-provisioned and automatically adds/removes nodes.


Curated by Mortar Data ( )



SD Big Data Monthly Meetup #3 (San Diego) - Wednesday, September 10

September SF Hadoop Users Meetup (San Francisco) - Wednesday, September 10

Real-Time Analytics with Storm by Ron Bodkin of Think Big Analytics (Los Angeles) - Thursday, September 11


Hunk for Hadoop (Houston) - Wednesday, September 10


Monthly SEMOP meeting (Southfield) - Tuesday, September 9


Cleveland Big Data at the Great Lakes Science Center (Cleveland) - Monday, September 8


Sep 11: Igniting Data Analysis with Apache Spark by Ryan Gimmy (Reston) - Thursday, September 11

North Carolina

Nikhil Kumar (SyncSort) on Converting SQL to MapReduce (Durham) - Tuesday, September 9

New Jersey

RDBMS on Hadoop? Yes, talk and hands-on session from Splice Machine (Flemington) - Tuesday, September 9

New York

Get Hands-on with Big SQL on Hadoop (New York) - Wednesday, September 10

FREE EVENT! Hadoop and Mainframes: Crazy, or Crazy Like a Fox? (New York) - Wednesday, September 10

Machine Learning on the Azure Cloud Platform (New York) - Friday, September 12


A Detailed Look at R on Hadoop (Moscow) - Thursday, September 11


Apache Spark - In Memory Map-Reduce (Hyderabad) - Saturday, September 13