Data Eng Weekly

Hadoop Weekly Issue #89

28 September 2014

This week’s issue has a lot of great content. It includes new open-source projects from Netflix and LinkedIn, several articles about Apache Spark (including details from Hortonworks on their plans for it), and news on Cascading on Tez. There’s also coverage of news in the ecosystem and several additional releases.


This post is aimed at getting started with a non-trivial Spark cluster without any existing infrastructure. It leverages Apache Mesos via the free-tier of Mesosphere for Google Cloud Platform. The tutorial explains how to launch a cluster, download VPN credentials in order to access the cluster, how to access the Spark and mesos consoles, and details running the Spark shell to execute a simple distributed computation.

A few months back, Cloudera announced that they plan to adopt Apache Spark as a successor to MapReduce for many systems. In the time since then, a lot of work has gone into making that a reality. This post gives updates on the status of Spark integration for Apache Crunch, the Kite SDK, Apache Solr, Apache Pig, and Apache Hive.

This is a quick walkthrough of setting up Apache Drill with the Pentaho Data Integration (PDI). There are instructions for starting an embedded Apache Drill and connecting PDI to Drill in order to execute a simple query against a json file.

Two folks from the Hadoop team at Yahoo have shared details on and recommendations for when to use Spark and Storm. Their presentation includes an introduction to both of these technologies (including example applications), and a detailed overview of the strengths and weaknesses of both.

The Databricks blog has details on two performance improvements in Spark 1.1—torrent broadcast and tree aggregation. Both improvements help to better utilize the network, which lead to 1.5-5x speed improvements in MLlib (ML algorithms tend to broadcast and aggregate lots of data across several iterations).

Databricks has created two reference applications—for analyzing logs and classifying the language of a tweet stream. These references aim to show how to build a fuller-featured application than is included in a basic tutorial/walkthrough.

Most developers won’t need to use the Apache Tez API directly—it’s predominantly intended to be used by other frameworks (e.g. Hive, Pig, and Cascading all have built atop of Tez). But if you were interested in what a standalone Tez application looks like, this post describes how to do a top-K calculation using Tez. It includes snippets (with descriptions) and the full code is available on Github.

Cloudera has posted new benchmarks for their SQL-on-Hadoop system, Impala (as always with a vendor benchmark—you might find different results with your own data). This time, they’ve compared it to Hive-on-Tez, Spark SQL, and Presto on a 21-node cluster. The results show that Impala has much query throughput in a multi-user environment and that it’s faster for three different types of single-user queries, too.

Based on experience from having Apache Spark available as a tech preview for HDP, Hortonworks has put together a two-phase initiative to improve Spark. Phase 1 consists of improved integration with Apache Hive, support for the ORCFile format, and improvements with security and operations (namely integration into Apache Ambari). Phase 2 focuses on improving scale and reliability (mostly around YARN integration), improving debug ability, adding wire encryption and authorization, and integration into the YARN Application Timeline Server.

The Hortonworks blog has a guest post from Concurrent CEO Gary Nakamura on the state of Cascading on Apache Tez. In a recent milestone, Cascading 3.0 WIP added support for Apache Tez as part of a new pluggable query planner. Future work includes improving scalability and performance and to add support for other cascading-powered libraries such as scalding and cascalog.

This post walks through enabling SSL encryption (including a client key) between HUE and Hive. It has an overview of the network communication between the two services in an encrypted setup, a guide for generating keys with keytool and openssl, and example configuration files.

Hortoworks has put together a few tutorials for Apache Kafka and Apache Storm. The first tutorial uses Kafka as a transport for real-time trucking events, the second show how to consume data in real-time using Storm, and the third is the old standby, WordCount, in real-time with Storm.


The Qubole blog has a recap of some recent announcements and news related to Hadoop. There’s some overlap with the content of this newsletter, but there are several new articles as well.

Apache Storm recently graduated from the Apache Incubator. A post on the Hortonworks blog has a bit more about Storm, how it’s being used, and background on incubator graduation.

Hadoop startup Continuuity has rebranded itself as Cask. At the same time, they’re open-sourcing/rebranding several products. First, their flagship product, Continuuity Reactor, is now open-sourced as the Cask Data Application Platform. Second, they announced a preview release of a real-time stream processing framework called Tigon that was built in conjunction with AT&T Labs. Third, there is a new name for their cluster management software (formerly Continuuity Loom), Coopr.

Videos of the presentation at the recent Strange Loop conference are now available on Youtube. The talks cover a number of topics ranging from programming languages to distributed systems. From the Hadoop ecosystem, there are talks on Samza and Cassandra.

SequenceIQ, makers of the Hadoop-as-a-Service platform Cloudbreak, announced an investment from Euroventures. The financial details of the deal were not disclosed.


HBase 0.99.0 was released. This is a developer preview, which is not intended for production use. It contains a number of enhancements (over 1,000 tickets were resolved) that will eventually become the basis of the 1.0 release. A couple of highlights include removal of Hadoop 1.x support, support for stripe compaction, and the addition of a Dockerfile to run HBase from source.

Cloudera Enterprise 5.1.3 was released this week. It contains fixes/improvements across Hadoop, HBase, HDFS, Hue, Hive, Impala, Oozie, YARN, Cloudera Manager and Cloudier Navigator.

Cloudera has announced a new version of their ODBC drivers for Apache Hive and Impala. The release includes bug fixes including better support for DECIMAL data types.

Hortonworks has updated the Spark technical preview to include Spark 1.1.0. Notable fixes include better integration with Hive 0.13 and support for ORCFile.

Inviso is a new open-source tool from Netflix for Hadoop job search and visualization. Job search is powered by ElasticSearch, which indexes job configurations. The visualization portion of the application includes plots of task attempts for a job, which are loaded from job history files. See the post for more details, including screenshots of the interface.

LinkedIn has open-sourced ml-ease, a large-scale machine learning library that includes backends for Hadoop and Spark. The software, which is available under an Apache License, supports Alternating Direction Method of Multipliers (ADMM) logistic regression.

Version 1.0.19 of Luigi, the workflow management system, was released. This release includes centralized resource limits, S3 api improvements, test fixes, and more.


Curated by Mortar Data ( )



Women in Analytics: Big Data Hadoop and Other Databases Pros/Cons (San Francisco) - Thursday, October 2

Securing Enterprise Data with Hadoop: What Are Your Options? (Santa Clara) - Thursday, October 2

Big Data and Data-Driven Business Security Considerations (Fremont) - Thursday, October 2


Reporting against Hadoop Data Sources Using Jaspersoft (Tempe) - Wednesday, October 1


How Apache Spark Fits into the Big Data Landscape (Westminster) - Thursday, October 2


Big Data Developer Day: A Leap Forward for SQL on Hadoop (Hopkins) - Wednesday, October 1


Big Data Everywhere (Chicago) - Wednesday, October 1


Talend: DI & Hadoop Integration Albert Mayer (Cincinnati) - Friday, October 3


Big Data Developer Kickoff (Calgary, Alberta) - Friday, October 3


HUG Italy: Primo Incontro a Milano (Milan) - Tuesday, September 30


Workshop: How to Think in MapReduce (Cluj-Napoca) - Tuesday, September 30


Architecture Night (Singapore) - Thursday, October 2