Data Eng Weekly

Hadoop Weekly Issue #114

29 March 2015

It's no longer a surprise when Spark is a big topic in an issue of Hadoop Weekly, but there are four great posts this week covering optimizing Spark programs, new features in Spark 1.3, and a case-study from Other topics covered include Docker in YARN, kerberos-enabled Hadoop, and Kafka. Also be sure to check out the releases, including a new golang implementation of Avro from LinkedIn.


This tutorial describes how to build a kerberos-enabled Hadoop cluster inside of a VM (the steps are valuable outside of a VM, too). The author provides a script for setting up kerberos before running the quickstart wizard that comes with Cloudera Manager. The script, which includes thorough comments, makes kerberos much less intimidating.

This post provides a brief introduction to the DockerContainerExecutor that was introduced in YARN as part of Apache Hadoop 2.6. It describes one of the main motivations for running inside of docker containers—managing system-level dependencies.

The following slides and video are from a presentation given at the recent Strata San Jose conference on optimizing Spark programs. Topics covered include understanding shuffle in Spark (and common problems), understanding which code runs on the client vs. the workers, and tips for organizing code for reusability and testability.

As noted in the Apache Spark 1.3 release, Spark SQL is no-longer alpha. This post explains that this guarantee means binary compatibility across Spark 1.x. It also describes some plans for improving Spark SQL (better integration with Hive), the new data sources API, improvements to Parquet support (automatic partition discovery and schema migration), and support for JDBC sources.

The Cloudera blog has a post from a software engineer working at on how they built a spark-streaming based analytics dashboard to monitor traffic related to superbowl ads. The system also uses Flume, HBase, Solr, Morphlines, and Banana (a port of kibana to Solr) as well as algebird's implementation of HyperLogLog. The post is a good end-to-end description of how the system was built and how it works (with screenshots).

For those looking to scale machine learning implementations, the Databricks blog has a post on Spark 1.3's implementation of Latent Dirichlet Allocation (LDA). The post describes LDA, common use-cases, and how it's implemented atop of GraphX (the Graph API for Spark).

This post describes how to enable support for impersonation from Hue in HBase so that users can only view/modify data which they're allowed to via HBase permissions. It also describes how to configure the HBase Thrift Server for kerberos authentication. There are screen shots of the Hue-HBase application, and several troubleshooting steps for common configuration issues.

As a developer, it can become easy to get used to peculiarities of a system you're working with. It's good to take a step back and understand these issues (or even decide if they really are issues!). In this case, the blog has a post that gathers feedback on "what is confusing about Kafka?" In addition to collecting the feedback, there are responses/links for several of the issues.

The Hortonworks blog has the third part in a series on anomaly detection in healthcare data. In this post, they use SociaLite, an open-source graph analysis framework to compute a variant of PageRank. The post gives an overview of SociaLite (which integrates with Python) and describes the implementation to find anomalies. All code is available on github.

Most folks working with batch systems start out with a simple workflow system that spawns one job after another via cron. From their, they often move to a job that runs based on the availability of input data. As a post on the Cask blog explains, it's difficult to implement a data-driven workflow efficiently. Most systems poll for the availability of input, which can be slow. The Cask Data Application Platform (CDAP) uses notifications to trigger jobs. The follow post describes the architecture in greater detail.


This post seeks to help understand the role of stream processing in the big data ecosystem. The author interviews several folks in industry, including Hadoop creator Doug Cutting, trying to answer the question "will streaming completely replace batch?" Reactions are mixed, but everyone seems to agree that stream processing tools for big data are getting better.

The agenda for HBaseCon, which takes place May 7th in San Francisco, has been posted. The conference has four tracks—Operations, Development and Internals, Ecosystem, and Use Cases.

O'Reilly has a new video training, "Introduction to Apache Kafka" by Gwen Shapira. The training is just under three hours and is aimed at Developers and Administrators.


Cloudera announced a maintenance release of Apache Accumulo for CDH 5 to fix the POODLE vulnerability.

Version 1.1.2 of Luigi, the workflow management tool, was recently released. The new version includes improved support for Spark.

The SDK for Google's Cloud Dataflow (similar to many DSLs like Scalding and Spark) is open source. The main "runner" implementation uses the Google Cloud Platform, but there's also implementation for Apache Spark. This week, the Apache Flink project announced a runner, which allows any pipeline written for Cloud Dataflow to run on a Flink cluster.

MicroStrategy announced that Apache Drill is certified with the MicroStrategy Analytics Enterprise Platform. The MapR blog has a brief introduction of how to configure the integration.

EMC has announced the Federation Business Data Lake, which combines several pieces of software with hardware. The software includes Pivotal HD (with mention of the Open Data Platform) and hardware includes EMC Isilon.

Cloudera Director 1.1.1 was released this week. Cloudera Director is a tool for provisioning and managing Hadoop clusters in AWS. This release includes several bug fixes and documentation updates.

Cask has announced version 2.8.0 of the Cask Data Application Platform (CDAP). The new version adds namespaces, fork/join for the workflow system, a new metrics layer, and more.

Sematext, makes of the SPM Performance Monitoring system, have announced that SPM now supports monitoring, alerting, anomaly detection for Apache HBase 0.98. The tool monitors a number of metrics including cache, replication, the WAL, and much more (290 metrics in total).

LinkedIn has open-sourced a golang library for Apache Avro. The library, called Goavro, supports decoding and encoding of data according to version 1.7.7 of the Avro specification. More details (including a few limitations) are described on the github site.

Version 0.9.6 of RDMA for Apache Hadoop was released this week.  The package is a derivative of Apache Hadoop that allows a cluster to use remote direct memory access (RDMA) interconnects to improve performance. It supports a Lustre and a hybrid file system where data is stored both in memory and on disk.


Curated by Datadog ( )



Building an Enterprise Company in a Consumer World, by Mike Olson of Cloudera (Palo Alto) - Wednesday, April 1

Getting Started with Spark & Cassandra, by Jon Haddad of Datastax (Culver City) - Thursday, April 2


Analyzing Real-World Data with Apache Drill and Hadoop (Tempe) - Wednesday, April 1


A Taste of Scala (Saint Paul) - Thursday, April 2


Hadoop + Spark (Northbrook) - Wednesday, April 1

Hadoop Data Hub, New Approaches to Data Management and Discovery (Chicago) - Thursday, April 2


Hadoop POC: Lessons Learned at American Family Insurance (Madison) - Tuesday, March 31


Machine Learning with Big Data Using Apache Spark (Okemos) - Tuesday, March 31


Apache Phoenix for HBase & Hadoop (Philadelphia) - Tuesday, March 31


Hands-on: Scalable Big Graph Data Processing in Spark (Vancouver) - Tuesday, March 31


Deep Dive into Apache Cassandra, with an Intro to Apache Spark Integration (Manchester) - Wednesday, April 1


The State of Flink and the Road Ahead (Berlin) - Tuesday, March 31


Bluemix Hadoop (Tokyo) - Tuesday, March 31

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit