Data Eng Weekly

Hadoop Weekly Issue #91

12 October 2014

With Strata+Hadoop World taking place this week in New York, we can expect to see a lot of announcements. But a number of folks have jumped out ahead of the conference, and there are several partnership and technical announcements in this week’s issue. On the technical side, Databricks posted a benchmark for terasort on Spark, and eBay has open-sourced Kylin, their Hadoop OLAP system. If you’re in NYC for Statra+Hadoop World, be sure to check out some of the 14 meetups happening this week!


This tutorial walks through the steps necessary to configure a Shark (Hive on Spark) thrift server and use it to power Tableau over ODBC.

In an in-depth and interesting post, Netflix has described their use of Presto, an SQL-on-Hadoop (or this case S3) system open-sourced by Facebook, on AWS. Netflix has over 10 PB of data in S3, runs a Presto cluster consisting of 250 m2.4xlarge instances, and supports around 2500 queries per day. They’ve contributed a number of improvements to Presto, including improving support for the Parquet file format and S3.

A paper presented at the OSDI conference this week focusses on testing in distributed systems. The paper considers reports of real-world failures of several distributed systems from the Hadoop ecosystem—Cassandra, HBase, HDFS, and MapReduce. The authors have a number of interesting findings including: 98% of failures are guaranteed to manifest on <= 3 nodes, 77% of failures can be reproduced by a unit test, and 92% of catostrophic failures are due to incorrect handling of non-fatal errors. They introduce Aspirator, a system for statically analyzing software to find these types of errors.

The Los Angeles Spark User Group recently hosted a panel of data scientists from Cloudera, MapR, and Pivotal. The panelists discussed Spark’s conception and history, their vision for the future of Spark, and more. Inside Big Data has a video of the panel.

This post covers setting up the Google Cloud Storage Hadoop FileSystem integration with Apache Spark. It covers the installation and configuration steps as well as some simple smoke tests to ensure the system is setup correctly.

The SequenceIQ blog has a post describing a system they’ve built for Hadoop monitoring. The system consumes metric log files generated by the Hadoop metrics system using collect. From their, the metrics are sent via Logstash to an ElasticSearch cluster. Kibana is used for dashboarding and visualization. SequenceIQ has published a development preview of the client and server daemons, which are run as docker containers.

DataBricks has published results on using Apache Spark to sort 100 TB and 1 PB of data. The benchmark used 206 nodes in AWS EC2 and completely the sort of 100TB in 23 minutes, which is just under 3x as fast as the previous record from 2013 on a 2,100 node Hadoop cluster. The post on the DataBricks blog has details on the experiment as well as background on several of the recent improvements to Spark that helped them achieve the speedup.

The Cloudera blog has a guest post from Syncsort on their work to add support for importing data from a mainframe to Hadoop. The post gives a bit of background about mainframes (which expose their data via FTP), the design and implementation, and experiences going through the patch submission and review process.

In another guest blog post, Syncsort writes on the Hortonworks blog about integrating their DMX-h product with Apache Ambari. DMX-h adds a new Ambari Service definition, which is exposed via the REST API.


ZoomData, makers of big data analytics and visualization software, announced $17M in Series B funding. ZoomData’s software supports Hadoop, Spark, and several other connectors.

After last week’s announcement that Cloudera has acquired visual analytics startup DataPad, we’re hearing from DataPad’s CEO and co-founder about the acquisition. This post has some background on the founding of DataPad (including the types of problems the company is trying to solve) and a glimpse into the future of DataPad’s software inside of Cloduera.

In a post celebrating Storm’s graduation from the Apache Incubator, Storm founder Nathan Marz recounts the history of the project. The post covers the creation of Storm, the process of open-sourcing Storm, the marketing and support that went into the early project, Storm’s technical evolution, and Storm at Apache.

Cloudera written about some of the work they’ve done for Apache Spark and some of their plans for the future of Spark. Examples of completed work include improving Spark-on-YARN, better support for HDFS caching, and integrating Spark streaming and Apache Flume. Plans for the future include Hive-on-Spark, lossless Spark streaming, and integrating Spark with the YARN timeline server.

Cloudera and O’Reilly have announced an expanded partnership around conferences. In addition to Strata + Hadoop World in New York, Strata conferences in Barcelona, San Jose, and London, have been rebranded “Strata + Hadoop World.”

Cloudera and Teradata announced an extended partnership as they work to optimizing the integration between Cloudera’s enterprise data hub and Teradata’s data warehouse through the Teradata Unified Data Architecture.

Businessweek has an article marking Hadoop’s success at permeating industries outside of silicon valley. They cite the Detroit Crime Commission, agriculture enterprise Monsanto, and the Indian government’s national identity registry as examples. The article also includes a discussion about the merits of open-source.

The post walks through the “data lake” metaphor… and introduces a few new metaphors along the way. There’s a good discussion of semi-structured data and the importance of generating useful data to put into a lake.

This is a quick post offering some commentary on the Cloudera-Terdata partnership announced this week. It points out that the partnership highlights the fact that Hadoop isn’t replacing the data warehouse, like a lot of folks have predicted.


A few weeks ago, Apache Accumulo 1.5.2 was released. Accumulo is a distributed key/value store based on BigTable built atop HDFS and Zookeeper. The 1.5.2 release contains performance and bug fixes for the 1.5.x branch (version 1.6.1 is the latest).

Accumulo 1.6.1 was also recently released. The release contains several performance improvements including better write-ahead log sync performance (by avoiding multiple syncs). There are also several bug fixes including a fix for upgrading from 1.5.x to 1.6.1 and an updated Guava version dependency (to match Hadoop 2.x).

Sematext announced support for monitoring of Apache Spark jobs as part of their Performance Monitoring (SPM) product. The introductory blog post includes screenshots of the Spark integration, which provides metrics for Spark Workers, Executors, and more. SPM is available both as a SaaS or on premises deployment.

Hadoop SaaS vendor Altiscale announced a new SQL-on-Hadoop offering this week. The system is built on Hive 0.13 and Tez, offers a web-base SQL query tool, and is leverages a partnership with Simba Technologies to offer ODBC access to the service.

Flue is a new project to add a transaction layer atop of Accumulo. The first alpha release was made alongside the announcement, and it uses Apache Twill for deploying into YARN.

Apache BigTop 0.8.0 was released. For those not familiar, BigTop is a project for integrating and testing a large number of ecosystem projects. This release is based on Hadoop 2.4.1, HBase 0.98.4, the latest version of Phoenix, and contains upgrades of several other ecosystem projects.

Cloudera Live is a zero-install demo of Hadoop available via a web browser. The demo has been updated to include an interactive tutorial that includes loading data into HDFS using Flume and Sqoop, creating and querying Hive/Impala tables, and indexing data into Cloudera Search.

eBay has open-sourced Kylin, their Hadoop OLAP engine. In addition to being another SQL-on-Hadoop system (supplying ANSI SQL), Kylin supports data cubes, approximate queries (using HyperLogLog), ACLs at the Cube/Project Level, and more. In comparison to other SQL-on-Hadoop systems, Kylin is a Multi-Dimensional OLAP whereas most others are closer to Relational-OLAP. The presentation below has many more details on the system, including information on the architecture and technical pieces.

Cascading 2.6 was released. The new version includes about 20 changes, including a new DecoratorTap and DistCacheTap to wrap existing classes.

Dataguise has announced that its DgSecure data governance software for securing Hadoop deployments now supports several Hadoop-as-a-Service offerings. Those include Altiscale, Qubole, and Amazon Web Services.

Trifacta v2 was released this week. The software, which focusses on data wrangling, includes visual data profiling tools, support for many common formats from JSON to Parquet, and uses both Spark and MapReudce. More details about each of these parts in the announcement.

HUE 3.7 was released. The new version includes a new app for Sentry with tools for managing roles and privileges, improvements to the Search app including several new widgets, and improvements to Oozie, HBase, Hive/Impala and more.

Version 0.17.0 of the Kite SDK was released. The new version adds support for namespaces, improved examples, new tools for running against development mini clusters, and more.


Curated by Mortar Data ( )



48th Bay Area Hadoop User Group (HUG) Monthly Meetup (Sunnyvale) - Wednesday, October 15

DevOps Special: Deploying Hadoop Using Docker Containers (Santa Clara) - Thursday, October 16


3rd Thursday Huddle! Hadoop & NoSQL Joining Forces (Dallas) - Thursday, October 16


A Leap Forward for SQL on Hadoop (Milwaukee) - Tuesday, October 14


HUG Pittsburgh Meetup (Pittsburgh) - Wednesday, October 15


Big Data & Analytics Developer Day (Chattanooga) - Wednesday, October 15

New York

Strata Conference Big Data: Commercialized Hadoop+Spark+R Solution (New York) - Monday, October 13

Practical On-line Approximation Algorithms in Storm with Ted Dunning - Monday October 13

2-for-1: Resource Management in Modern Hadoop + Hadoop Application Architecture (New York) - Tuesday, October 14

Sandy Ryza: Why Is My Spark Job failing? (New York) - Tuesday, October 14

Becoming a Scalable Data Scientist with GraphLab (New York) - Wednesday, October 15

Cloudera User Group Meetup at Strata + Hadoop World (New York) - Wednesday, October 15

The Past, Present and Future of Apache Kafka (New York) - Wednesday, October 15

Why Pig? + Pig on Spark Update during Strata (New York) - Wednesday, October 15

Going Beyond Hadoop: Faster Big Data (New York) - Wednesday, October 15

Elasticsearch Meetup at Twitter (New York) - Wednesday, October 15

Sqoop Meetup at Strata + Hadoop World (New York) - Wednesday, October 15

HBase Meetup on the Night before Strata/HW (New York) - Wednesday, October 15

Big Cybersecurity Analytics Meetup with Sqrrl (New York) - Thursday, October 16

Informal Hue Meetup at Strata + Hadoop World: Hue 3.7 (New York) - Thursday, October 16


Full-day Hadoop MapReduce Hands-On (Cambridge) - Saturday, October 18


Web-Scale Data Mining and Processing (Warsaw) - Wednesday, October 15


Big Data, Bases de Données Graph, Démo Hadoop et MapReduce (Casablanca) - Wednesday, October 15


Introduction to Apache Flink (Berlin) - Wednesday, October 15


Introduction to Big Data & Hadoop (Bangalore) - Thursday, October 16

Big Data/Hadoop Forum (Chennai) - Saturday, October 18