Data Eng Weekly

Hadoop Weekly Issue #90

05 October 2014

It’s a relatively quite week with only two releases (the calm before the Strata + Hadoop World storm?). In the technical and news areas, two themes are playing out this week. First, there is a lot of great content on stream processing frameworks—namely Storm and Spark streaming. Second, there are several articles about integration YARN with other systems and frameworks (OpenStack, Mesos, AWS). There are also pieces on Spark MLlib, RStudio on Amazon EMR, and the cost-based optimizer for Hive—something for everyone.


Getting started with a new distributed system typically requires looking through tutorials, documentation, and even source code. This presentation aims to gather all of that information (and more) into a single training deck for Apache Storm. It covers five key areas—an introduction, Storm’s core concepts, operational considerations, Storm app examples, and wirbelsturm for local development.

This presentation gives an introduction to Apache Optiq (incubating) and describes how the Optiq cost-based optimizer is being added to Apache Hive 0.14. There are some examples of optimizing the query plan for star schema, left-deep tree, and bushy tree queries. It also explores the importance of having statistics about the data, and there are some impressive benchmarks on TPC-DS queries at the end.

This post walks through five different types of logs that are important for understanding and debugging a Hadoop cluster. Given that YARN is relatively new, this is a good introduction to the new types of logs introduced in recent versions of Hadoop.

Spark’s MLlib contains a decision tree implementation which can be used in data classification problems. Even if you don’t know what a decision tree is, the article contains an introduction before it dies into the technical details. The post has an example in python (and links to examples for Java and Scala), describes the optimizations in the implementation, and has an overview of scalability (both dataset size and number of features). There were also some impressive speed gains in Spark 1.1 vs. Spark 1.0.

DataStax Enterprise 4.5 integrates Apache Cassandra with Apache Spark using the Spark Cassandra Connector. This post includes a walkthrough of using Spark’s MLlib with data stored in Cassandra.

The SequenceIQ blog has an example of implementing a correlation function for Spark. While the implementation duplicates some functionality found in MLlib, the example shows how to write testable Spark code (and has example tests). The code is available in its entirety on github.

Many folks get started with Hadoop in the cloud and end up storing data in object stores like S3 as a result. This post from the Altiscale blog discusses some of the drawbacks of storing data in an object store vs. a true file system.

Datameer has written about how they’ve reengineered the backend to Datameer 5 to be framework agnostic. Previously, the system was tightly coupled with MapReduce, but it can now also use Tez and small job/local execution engines. The post also describes why they use Apache Tez over Spark (although they do say that Spark will eventually be integrated).

While Spark has had integration with Kafka for several releases, this post goes much further than the Spark-bundled KafkaWordCount example. In fact, the post contains everything needed to get started with Kafka and Spark Streaming—including overviews of both systems that describe core concepts. The post culminates with a full example that reads Avro-encoded data from Kafka (in parallel across partitions), does some simple computing, and writes the data back to Kafka. There is also a summary of known issues, testing, and performance testing.

This post shows how to build an Amazon Elastic MapReduce (EMR) cluster that integrates RStudio. After bootstrapping a cluster, it walks through changing security settings to allow access to the RStudio web interface, describes how to use the rmr2 package to run a MapReduce job from R, and shows how to pull in some real-world (global weather measurement) data for analysis.

This tutorial explains how to install Apache Spark in the MapR sandbox (a VM running in VMWare or Virtualbox). After that, it has some examples with the spark-shell to run simple queries against a text-based Spark RRD.


In recent years, a number of systems for managing clusters in a general purpose way have emerged. Among them are YARN, Mesos, kubernetes, and OpenShift. It seems likely that we won’t see one clear winner, but that these systems will learn to coexist. This post on the Hortonworks blog describes plans for integrating OpenShift and Kubernetes with YARN.

Meanwhile, a framework for mesos, Myriad, is looking to integrate YARN and Mesos—but in the other direction. In short, Myriad is used for scaling YARN clusters in Mesos. This post has some more details on Myriad and its roadmap.

Cloudera announced the addition of Martin Cole (former Group Chief Executive of Technology at Accenture) and Steve Sordello (CFO, LinkedIn) to their board of directors. The new appointees will work on extending Cloudera’s vertical applications and serve as the Audit Committee Chair, respectively. While these appointments are well deserved, they also bring the gender composition of the board members of top Hadoop venders (Cloudera, Hortonworks, and MapR) to 20:1.

A new book from O’Reilly, “Getting started with Impala,” is now in early release. A post introducing the book has a Q&A with the book’s author, John Russell.

Cloudera announced this week that they’ve acquired DataPad, makers of collaborative BI/analytics software. In the press release, Cloudera says that DataPad’s co-founders will build data backends for business intelligence tools aimed at “simplifying use of Cloudera’s products."

This post questions the conventional wisdom of running a real-time database separately from a Hadoop cluster. It discusses a few arguments for running NoSQL solutions on Hadoop (real-time analytics, scalable storage) and several DB-on-Hadoop solutions like MapR-DB, HBase, and Apache Accumulo.

O’Reilly has announced a new book by Jay Kreps on logging in distributed systems. The book is based on several blog posts, and covers a number of concepts at the heart of a big data platform.

SequenceIQ announced that they’ve joined the Hortonworks Technology Partner Program. SequenceIQ is developing Cloudbreak, a cloud agnostic tool for provisioning and autoscaling HDP clusters.

Hortonworks and Oracle have announced that the Oracle Data Integrator (ODI) is certified with HDP 2.1.

Datanami has an overview of the Forrestor Wave report on NoSQL databases. The report looked at key-value databases and document-oriented systems. Product offerings form MapR, DataStax, and Amazon Web Services all scored high in the report.

The DBMS2 blog has two posts this week, the first on Streaming for Hadoop. It discusses both stream processing frameworks (Spark streaming, Storm) and data transfer systems (Flume, Kafka) in the wild. There are some interesting observations, such as that Kafka is being used by internet companies more than enterprises (citing lack of security as a concern). The post also tries to articulate the politics of streaming software tools with respect to vendors.

This post rehashes the argument of whether Spark or Tez is the successor to MapReduce. While many companies seem to be throwing their weight behind Spark, Hortonworks sees a place for both Spark and Tez.


Ferry is a tool for provisioning distributed systems (with a focus on several in the Hadoop ecosystem). It began as tool for running a local setup in docker containers, but has recently announced support for OpenStack and Amazon Web Services. With this addition, it’s incredibly easy to build a Hadoop cluster (with whichever components you want) inside of an Amazon VPC.

Red Hat Storage Server 3 was announced. The new version adds a plug-in for the Hadoop FileSystem API and integration with Apache Ambari.


Curated by Mortar Data ( )



Making Hadoop Enterprise Ready, by Brett Rudenstein of WANDisco (Santa Monica) - Monday, October 6

Self-Service Data Exploration Using Apache Drill, by David Kewley of MapR (El Segundo) - Thursday, October 9

#SDBigData Monthly Meetup (San Diego) - Wednesday, October 8

Washington State

Deep Dive into Spark, Tachyon, and Mesos Internals (Bellevue) - Wednesday, October 8


PDI on Hadoop (Addison) - Monday, October 6

AT&T Foundry Tour and Meetup with AT&T Employees and Big Data in the Big D (Plano) - Thursday, October 9

Kiyu Gabriel: Cassandra and DataStax (Houston) - Wednesday, October 8

Introduction to Hadoop Course, Part 2 (Austin) - Saturday, October 11


AZSSUG Oct Meeting: Big Data Presenters Josh Sivey/Orion Gebremedhin (Tempe) - Wednesday, October 8


Celebrate Data Science in the Cloud (Denver) - Thursday, October 9


R on Hadoop (Saint Petersburg) - Wednesday, October 8

New Jersey

Apache Ambari and Slider: Deployment & Resource Management (Hamilton Township) - Tuesday, October 7

New York

This Ain't Your Father's Search Engine (New York) - Thursday, October 9


Hadoop User Group (Paris) - Monday, October 6


NoSQL in a Hadoop World (Manchester) - Tuesday, October 7

October Hadoop Meetup (London) - Tuesday, October 7


Rapture I/O + Apache Spark (Prague) - Tuesday, October 7


Introducing Apache Flink (+) Hadoop Operations Powered By ... Hadoop (Stockholm) - Wednesday, October 8


Hadoop Security and Apache Sentry (Hyderabad) - Thursday, October 9

Hadoop Workshop (Hyderabad) - Thursday, October 9