Data Eng Weekly

Hadoop Weekly Issue #105

25 January 2015

It was a busy week of announcements and releases—Google and Hortonworks announced a new integration for HDP on Google Compute Engine, Google and Cloudera announced a joint project to bring a Spark backend to the Google Dataflow SDK, Netflix announced a new open-source project, and Apache Flink 0.8.0 was released. In addition, there are articles on machine learning from PayPal and Databricks as well as several other high quality posts on Kafka, HBase, and more.


PayPal has a post on how they’re training Restricted Boltzmann Machines to build Deep Belief Networks. While many folks are using GPUs to speed up these types of computations, PayPal was looking for a way to make use of existing Hadoop infrastructure. They’ve implemented an adaptation of IterativeReduce running on YARN (Hadoop 2.4.1). The post has a thorough overview of how they use the YARN APIs to build their system, and they show good results from an evaluation of the implementation.

Apache Flink is a large-scale, in-memory processing and data streaming framework that’s compatible with Hadoop. This presentation gives an overview of the API ( which resembles Spark and Scalding and includes streaming and graph APIs), compatibility with the Hadoop ecosystem (Mappers and Reducers can be used unmodified), the included visualization tool, the runtime (and how it compares to Spark), and the project roadmap (improvements to fault tolerance and streaming fault tolerance, backend for Hive, and more more).

Making the leap from running a Hadoop job in-memory to on a cluster (especially a pseudo distributed one) can be frustrating as you battle configuration, setup, etc. This post suggests using the Kiji Bento Box, which will build a local Hadoop cluster with Hadoop and setup all the proper environment variables to interact with that cluster.

The ingest tips blog has some guidance for using Kafka to ship large messages. The post has several suggestions for avoiding large messages, as well as advice for how to configure Kafka to handle large messages if the other suggestions aren’t feasible.

The Databricks blog has a post about the implementation of Random Forests and Gradient-Boosted Trees in Spark 1.2’s MLlib. The post gives a high-level overview of how decision trees work and how MLlib distributes the computation. It then provides some code snippets to provide an introduction to the API and shows several scalability results based on evaluating a dataset in AWS EC2.

MapR has a new Whiteboard Walkthrough (both a video and a transcript) about HBase key design. OpenTSDB’s schema is used as an example, and the presentation discusses things like sequential vs. random keys and the importance of knowing the data access patterns.

The Hue blog has a post on making Hue highly-available by running multiple instances of the Hue application behind a load-balancer. The tutorial walks through the requirements (a HA database backend and nginx/haproxy installed), and describes how to enable the load-balancer (which runs via supervisord).

Sematext offers a monitoring solution for Spark as part of their Performance Monitoring (SPM) product. This post by a customer describes how to integrate the monitoring with Spark and gives an example of a production issue they solved with the help of SPM. Given that Spark is still relatively young, it’s good to see more solutions for monitoring and debugging helping folks become more productive.

Hadoop’s cost advantage is based on using commodity hardware, including commodity hard drives. Not all hard drives are the same, though, and failure can be expensive and time consuming. Cloud-backup provider Backblaze has posted a new analysis of hard drive failure rates based on their experience with many different kinds of disks. They look at drives from HGST, Seagate, Toshiba, and Western Digital.

The Hortonworks blog has a post about the past, present, and future of HBase High Availability. The majority of the post looks at the recently added Timeline-Consistent Region Replicas, which provide a read-only version of the data in case of a region failure. Combined with best practices, this feature allows for 99.99% availability, although clients must decide if they need strict consistency (i.e. can only query the primary region) or if they can accept stale data. Looking ahead, the HBase team is working on write-availability during failures and cross-datacenter consistency.

A lot of presentations and posts on event-processing are either very high-level and theoretical or about the low-level details of a particular technology. This talk centers around a few technologies (kafka and avro), but it strikes a good balance of theory and practice. There are several important details and ideas related to building an event-based processing system.


The Register has a look at Hadoop-as-a-Service vendor Qubole. The post describes Qubole’s platform (and its differentiators), gives some stats about how it’s being used (processing around 86PB/month), and describes a bit about its customers/demand (planning to be on Azure marketplace soon, many folks are using log data stored on S3).

In May, Hortonworks acquired Hadoop security company XA Secure and shortly thereafter Cloudera acquired Gazzang. Gigaom Research has a post looking at what Hortonworks and Cloudera have done with their acquisitions—which parts are free, open-source, or remain proprietary. It also looks at the Apache projects related to Hadoop security—Sentry and Ranger, which have some overlapping goals.

RCRWireless News has an interview with Xplenty CTO Saggi Neumann discussing several predictions for Hadoop in 2015. Among the predictions: Spark will take off, there will be increased competition among Hadoop vendors, Hadoop will transition to the cloud, and companies trying to deploy Hadoop will continue to see a shortage of qualified candidates.

This post has advice for companies trying to build a team for a Hadoop deployment: which roles and “tiers” of employees to hire. A lot of folks approach Hadoop without a full understanding of all the roles that need to be filled, and this can lead to under-estimating the amount of resources (and work) needed to be successful.

Spark Summit East is in New York on March 18th and 19th. The agenda, which covers three tracks (Developer, Applications, and Data Science), has been posted.

“Advanced Analytics with Apache Spark” is an upcoming book by several members of the Cloudera Data Science team. The book is currently in early release from O’Reilly Media. The Cloudera blog has an interview with the authors about the goals of the book, the intended audience, and more.

Google and Hortonworks announced this week that Hortonworks HDP 2.2 is now available on the Google Cloud Platform. The integration uses Google’s bdutil to build a cluster that’s provisioned via Apache Ambari.

Pachyderm is a new startup from the Y Combinator Winter 2015 class which is building an alternative implementation of Hadoop. Their distributed file system and MapReduce framework is open-source, and makes heavy use of HTTP. The company and software are still in early stages but are worth keeping an eye on.

Revolution Analytics, makers of the RHadoop packages, announced that they’re being acquired by Microsoft. The announcement recognizes Microsoft’s recent embrace of open-source, which includes Linux on Azure and support for Hadoop via Azure HDInsight.


Pivotal has released version 1.4 of GemFire XD, its distributed in-memory database. The new version includes support for a JSON data type, persistence of data in Hadoop, and more.

Google’s Cloud Dataflow is a system for distributed processing that combines batch and stream processing. While Google offers an implementation for their backend stack, there SDK is open-source and is amenable to additional backend implementations. Google and Cloudera have collaborated on a new backend for the Dataflow SDK which executes via Spark. The project is newly incubating in Cloudera Labs.

Netflix has started a new open-source project called Surus, which will provide a number of UDFs for Pig and Hive. The first of these UDFs (they plan to add more over the coming year) is for scoring predictive models in Pig using the Predictive Modeling Markup Language. The post has an example of building a model in R and evaluating the model on billions of rows using Pig. provides a menu bar helper for Mac OS X for viewing jobs in the JobTracker/Resource Manager (including notifications of started/completed/failed jobs). Version 1.3 was released this week with support for CDH5/YARN.

Apache Flink released version 0.8.0 this week. The new release includes a new Scala API, adds new windowing semantics to Flink Streaming, and includes many performance and usability improvements.


Curated by Mortar Data ( )



Operating in a Multi–Execution Engine Hadoop Environment (Santa Monica) - Tuesday, January 27

Apache Flink: Fast and Reliable Large-Scale Data Processing (Palo Alto) - Wednesday, January 28

Next-Generation Access Control for Hadoop, HBase and other NoSQL Databases (Fremont) - Thursday, January 29

Spark + Cassandra (Santa Clara) - Thursday, January 29


Advanced Data Storage Technologies: AVRO and Parquet (Broomfield) - Wednesday, January 28


January MySQL Meetup: Big Data (Oklahoma City) - Wednesday, January 28


Analytics with MapR and Hadoop (Saint Louis Park) - Thursday, January 29


Design Patterns for Storm and Kafka (Saint Petersburg) - Wednesday, January 28

North Carolina

Creating a Next-Generation Big Data Architecture (Charlotte) - Wednesday, January 28

District of Columbia

"Sparkling Visualizations"­: Data Viz Solutions + Spark (Washington) - Wednesday, January 28


Introduction to Big Data Techniques for Cybersecurity (Rockville) - Tuesday, January 27


Datameer 5.0: Hadoop Like Never Before (Tel Aviv-Yafo) - Monday, January 26


Hadoop: What Is It... Why Does It Matter? (Aarhus) - Tuesday, January 27


Real-Time Insights from Big Data (Dubai) - Tuesday, January 27


An Introduction to Scala and Hadoop (Zielona Góra) - Wednesday, January 28


Flink Community Updates & New Features: SQL-Style Queries, Akka, Hadoop Compatibility (Berlin) - Wednesday, January 28


Haven, Flink, Hadoop Use Case (Paris) - Thursday, January 29


Cassandra & Java + Cassandra & Spark for the Internet of Things (Barcelona) - Thursday, January 29


Recommendation 2.0 (Bangalore) - Saturday, January 31