Data Eng Weekly

Hadoop Weekly Issue #84

24 August 2014

This week’s edition has a lot of great technical content from prominent Hadoop vendors Hortonworks and Cloudera as well as newcomer SequenceIQ. There are also a couple of interesting articles based on real-world experience covering an A/B testing platform and Apache Zookeeper. Those types of articles tend to be quite good but more difficult to find—as always, if you have suggestions for the newsletter please send them my way!


Hortonworks has posted a video series on the most recent release of their distribution, HDP 2.1. The videos, which are recordings of several webinars, cover a large number components including YARN, HDFS, Hive, and Ambari.

A guest post on the Hortonworks blog describes how SAS is working to bring their High-Performance Analytics (HPA) and LASR Analytics Server to YARN. The systems were originally built to run on as MPI applications in which SSH was used to launch processes. With YARN, HPA uses the framework for process management, and there are improvements like enforcing CPU and memory limitations.

The Hortonworks blog has a post on an in-progress feature called container delegation. Before diving into container delegation, the post gives an intro to YARN’s resource and workload management. The new feature will be used, for among other things, to provide additional per-query resources to a long-running application.

The SequenceIQ blog has a post on the YARN FairScheduler. The post has an introduction to the FairScheduler, the scheduling challenges, and some of its configuration options. Using an example test and an R-based analysis tool (which is open-sourced), the post finds that the FairScheduler is good at maintaining fairness.

The Hortonworks blog has had a number of security related posts in the past week. This post summarizes the coverage, which includes posts on Apache Argus and Apache Knox. It also discusses posts from some partner vendors—Protegrity, Voltage Secruity, and Dataguise. Finally, it touches on some new Hadoop features—Transparent Data Encryption for HDFS and a Key Provider API and accompanying Key Management Server.

Apache Spark ships with the spark-submit script for submitting a job to a Spark cluster. Sometimes, it’s useful or necessary to programmatically submit a job. This post describes how to write a Scala program to do so, and how to invoke the resulting binary jar.

This post serves as a a good introduction to partitioning of Hive tables. It outlines the motivation and benefits of partitioning and includes several tips and best practices.

The Cloudera blog has a post with several tips and examples for writing powerful Hive queries. It includes example queries with the LAG and LEAD analytics function as well as using LATERAL VIEW and a UDTF to execute nested SQL queries. It also suggests some ways of organizing data, including the notion of a “supernova schema” which is somewhat akin to a materialized star-schema as a single table.

DZone has published a cheat sheet for Apache Hadoop. It includes things like HDFS architecture, HDFS command line examples, an overview of YARN, and an introduction to MapReduce. It also covers Pig and Hive as well as providing links to several ecosystem projects.

Camille Fournier, Zookeeper PMC and Rent the Runway CTO, spoke on using Zookeeper in the wild. Her talk covers a number of systems that use Zookeeper as well as a number that do not. One of her conclusions is that, while Zookeeper has a number of use-cases, it’s not always the best tool for the job.

The Pinterest engineering blog has a post on their A/B analytics platform. The post covers the implementation, which uses Kafka, Storm, MapReduce, HBase, and more. There’s an overview of the MapReduce workflow, the serving of metrics via HBase, and real-time processing via Storm. There’s also a discussion of statistical significance and group validation via chi-square.


A new book on Apache Flume is in early release and available as an eBook from O’Reilly. The book is aimed at developers deploying and customizing Flume.

Allied Market Research recently released a report on the Hadoop-as-a-Service (HaaS) market. It expects that market to growth rapidly to $16.1B by 2020. The report notes that HaaS doubled from 2012 to 2013, and it expects that HaaS will become more and more competitive with on-premises deployments.

TPCx-HS is a new benchmark specification aimed at measuring the Hadoop Runtime, Hadoop Filesystem API implementations, and MapReduce layers. It is claimed to be the first “Industry Standard Big Data Benchmark,” and there are already plans for additional. The ODBMS blog has an interview with Francois Raab, the author of the TPC-C Benchmark, and Yanpei Chen of the Performance Engineering Team at Cloudera. In the interview, they discuss some plans for big data benchmarks in more detail.

Using Apache BigTop, CDH5 has been tested in conjunction with GlusterFS 3.3 (specifically its glusterfs-hadoop FileSystem). There are some more details on the implementation in a guest post on the Cloudera blog.

The MapR blog has a transcript and video of a recent presentation by their CEO John Schroeder where he spends 5 minutes talking about several applications of Hadoop. He talks about the Aadhaar project’s biometric database, health care, advertising, music personalization, and MinuteSort.


Version 0.16.0 of the Kite SDK was released. This release adds support for Apache Spark, adds a new command-line ETL tool, fixes generation of Parquet Hive tables on Hive 0.13+, and adds a new parent pom for Kite SDK apps written for CDH5.

The folks at SequenceIQ have released a new docker image for Apache Hadoop 2.5.0. Like previous versions, their are psuedo-distributed and fully distributed variants of the image. The image uses Apache Ambari to provision a cluster.

Microsoft made some announcements about their Azure cloud services this week. Among them, they announced the general availability of Apache HBase for HDInsight. The service had been in preview since June.

Spindle is a new analytics platform recently open-sourced by Adobe Research. It combines Apache Spark for processing, Apache Parquet for a data storage format, and a Spray-based HTTP server.

Mortar, the Hadoop/Pig as a Service system, has announced support for running jobs in local mode to improve development iteration.


Curated by Mortar Data ( )



eHarmony's Hadoop Program (Irvine) - Thursday, August 28

Cybersecurity & Big Data Analytics with Hadoop (Mountain View) - Thursday, August 28

HBase Meetup @ Sift Science (San Francisco) - Thursday, August 28


MongoDB and Hadoop: Driving Business Insights (Austin) - Monday, August 25


Enabling Advanced Analytics & From Sandbox to Production PA (Kansas City) - Monday, August 25


Batch Data Processing at Spotify with Luigi (Madison) - Tuesday, August 26


Data Governance in Big Data - Cloudera/Gazzang (Dublin) - Tuesday, August 26

North Carolina

Tresata on Omnichannel Marketing Analytics in Hadoop (Charlotte) - Wednesday, August 27

RTP - Big Data Developer Day (Durham) - Thursday, August 28


Apache Spark Lessons Learned (McLean) - Tuesday, August 26

New Jersey

Storm: Real-Time Big Data Stream Processing at WebMD (Hamilton Township) - Tuesday, August 26


Hadooping @ Prague (Prague) - Monday, August 25


Database as a Service (CouchDB, MongoDB, Cassandra, DB2, Hadoop) in the Cloud (Zurich) - Tuesday, August 26


PaaS and Big Data Tools (Melbourne) - Wednesday, August 27

HDInsight: MapReduce and Beyond (Melbourne) - Thursday, August 28


3rd Spark London Meetup (London) - Thursday, August 28


Apache Spark: In Memory Map-Reduce (Hyderabad) - Saturday, August 30


Spark Meetup (Hangzhou) - Sunday, August 31