Data Eng Weekly

Hadoop Weekly Issue #127

28 June 2015

Rather than a single theme, this week's issue has a wide range of content. On the technical side, there are articles covering R & Hadoop, Docker & Hadoop, Spark, HBase, Presto, and Cascading. In news, there are two new Hadoop books in pre-release, coverage of two online-courses, and an interesting interview with Cloudera's new VP of Engineering. For releases, there's a new version of Apache Flink, a new version of Luigi, and a new open-source workflow framework called schedoscope.


This post discusses four open-source solutions for using Hadoop with R, each of which has its own strengths and weaknesses. The options are R on a workstation or shared server to connect to Hadoop (which works with rhdfs, rhbase, RHive, and more), Revolution R Open (which works with similar tools but adds the Intel Math Kernel Libraries), and RMR2 for executing R inside of MapReduce programs.

BlueData provides a software platform for deploying Hadoop using virtualization. A recent version of the platform adds support for deployment via Docker containers in addition to hypervisors. This post compares the trade-offs of the two options in terms of performance, reliability, and security.

Spark 1.4 has greatly enhanced the builtin UI for visualizing job details. This post gives a tour of these new features, which include a timeline view of spark events (across jobs, within a job, and within a stage of a job) and the execution DAG (which shows RDD transformations and how they map to operations). There are a lot of useful features in here, such as the ability to visualize the breakdown of time spent in a stage across compute/shuffle/deserialization/serialization/etc.

Logstash 1.5 added an integration with Apache Kafka, which is the subject of this post. The article shows how to use Logstash to read and write data from Kafka, describes some of the important configuration settings of the integration, and discusses the scaling characteristics of the integration.

The Altiscale blog has an update on the efforts to integrate YARN and docker. Rather than continuing to develop the DockerContainerExecutor, the current plan is to extend the existing LinuxContainerExecutor to also support docker containers. Otherwise, the two were going to share a lot of very similar code (e.g. for creating cgroups within which to run tasks).

This post introduces Pankh, which is a demonstrative application for building a real-time stream processing system with Kafka, Spark Streaming, and HBase. The post describes the main components of the canonical stream processing architecture and describes the component implementations used by Pankh.

In the latest post in guide to MapReduce frameworks, this post describes how to implement a left-join with the Cascading Java APIs. The code verbosity fits somewhere between Pig/Hive (quite short) and raw MapReduce (quite long). The post describes the details of the implementation and the full code (including a unit test) is available on github.

A common pattern in HBase schema design is to prefix keys with a salt in order to equally distribute load (avoid hot regions) when key prefixes are changing slowly. This post describes how to build a custom InputFormat to run MapReduce jobs over a logical key range for a salted table. The implementation overrides the getSplits() method, which is described in detail.

The Teradata blog has a post describing why they've chosen to adopt Presto and their near-term plans for contributing. On the former, the post gives background on Hadapt (whose architecture didn't fit with low-latency queries), the IQ execution engine they were developing for low-latency, why IQ didn't quite fit with Tez, and several of the advantageous features of Presto.


MapR and Google Cloud Platform announced that they're teaming up to provide a $500 credit for students who register for the free MapR On-Demand Training classes. According to my calculations, this should provide enough credit to try Hadoop on multi-node clusters for several days.

The Altiscale blog aims to clear up misconceptions about Spark and Hadoop. The key point is that Spark is only replacing a small part of the Hadoop ecosytem—MapReduce—and it is often deployed insider of Hadoop via YARN and uses HDFS for storage.

The Platform has an interview with Daniel Sturman, formerly of IBM and Google, who was just hired as Vice President of Engineering at Cloudera. In the interview, Daniel discusses his history at IBM, some of the projects he worked on at Google, how various Google technology maps to open-source software, and some broader industry vision. It's interesting to hear the perspective of someone who has been working with technology that's (sometimes) years ahead of the open-source community.

The In-Memory Computing Summit is Monday and Tuesday this week in San Francisco. There are several speakers and presentations from the broader Hadoop community.

The call for abstracts for Spark Summit Europe, which takes place in Amsterdam this October, ends on July 7th. Presentations are between 15 and 30 minutes in length.

The "Scalable Machine Learning" course on edX, which is sponsored by Databricks and taught by UCLA professor AMeet Talwalkar, starts this week. The free course focusses on building principles for real-world machine learning pipelines and applying these using Apache Spark.

Two new Hadoop-related books are in pre-release and shipping soon. "Hadoop Security" describes implementing security for all parts of the Hadoop stack. "Big Data for Chimps" is a guide for data processing on Hadoop that studies several real-world problems.


Apache Ranger (incubating) 0.5.0 was released about two weeks ago, and the Hortonworks blog has the details on the new version. In addition to fixes and enhancements, the release adds centralized admin/auth/auditing for Solr/Kafka/YARN, a key management store, metadata protection for Hive, support for Solr indexing of audit data stored in HDFS, and more.

Apache Flink 0.9.0 contains a number of new features and improvements. These include exactly-once fault-tolerance for streaming programs, a table API, the Gelly Graph Processing API, the Flink Machine Learning Library, Flink on YARN with Tez, a new RPC system based on Akka, and a static code analysis for the Flink Optimizer. The release announcement has details on these features and other improvements and fixes.

Version 1.3.0 of Luigi, the open-source workflow framework, was released. This new version includes improvements to the centralized scheduler, bug fixes, initial support for Google Cloud Storage and Google BigQuery, and more.

CDH 5.3.5 was released with a fix for rolling upgrades.

Version 2.1 of Elasticsearch for Apache Hadoop adds support for Spark, Spark SQL, and Storm. The Spark integration supports translation of Spark SQL to Elasticsearch's DSL and the Storm integration provides read/write to Elasticsearch. The new version is certified for CDH 5.x, Databricks Spark, HDP 2.x, and MapR 4.x.

Schedoscope is a new open-source project providing "a scheduling framework for painfree agile development, testing, (re)loading, and monitoring of your datahub, lake, or whatever you choose to call your Hadoop data warehouse these days." Datasets (including dependencies) are defined using a scala DSL, which can embed a SQL query to build the dataset. The tool includes a test framework to verify logic and a command line utility to load and reload data.


Curated by Datadog ( )



The Spark Kernel (San Francisco) - Tuesday, June 30

July 2015 Meetup (San Francisco) - Wednesday, July 1


Real-Time Processing on Hadoop for IoT: Overview and Demo (Tempe) - Wednesday, July 1


Overview of Apache Flink: Next-Gen Big Data Analytics Framework (Chicago) - Tuesday, June 30


Apache Spark: What Is All the Hype About? (Saint Petersburg) - Tuesday, June 30

New Jersey

ELK Stack and Spark (Princeton) - Tuesday, June 30

New York

Real-Time Big Data Analytics with Apache Solr and Spark (New York) - Wednesday, July 1


Facial Recognition + GraphX for Graph Analytics (Cambridge) - Tuesday, June 30


Why Switch from Hadoop to Spark? (Buenos Aires) - Monday, June 29


Share and Analyze Genomic Data at Scale with Spark (London) - Wednesday, July 1


IBM | Spark Signature Moment Singapore Event (Singapore) - Thursday, July 2