Data Eng Weekly

Hadoop Weekly Issue #36

22 September 2013

This week's newsletter contains a ton of new technical articles, covering everything from operations and deployment to data science on Hadoop to several projects from the ecosystem. It's one of the densest issues in the past several weeks -- I think the post-summer rush to Hadoop 2.x GA has started in ernest.


Ansible is a configuration management system that uses SSH (rather than installing remote agents) to configure a node-- a mechanism that's great for getting up and going quickly. The Hadoop Ansible Playbook configures Hadoop, HBase, Ganglia, Fluentd, ElasticSearch, and Kibana3. It's been tested on Ubuntu with Ansible 1.3, and it installs CDH 4.4.

LinkedIn's real-time processing framework, Samza, was accepted into the Apache Incubator. The LinkedIn engineering blog caps the moment by explaining the history of Samza, how it was built to complement Apache Kafka, and how it integrates with Hadoop YARN. There are also a number of useful links to documentation, code, and mailing lists.

Hortonworks and Dell published a reference configuration for the Hortonworks Data Platform (HDP) on Dell hardware (specifically the PowerEdge R720XD for Data Nodes). They have recommendations for testing, small clusters (6 nodes), and production clusters (14+nodes). The configuration seems to be rather high-end -- datanodes have a 2U chassis, 12-cores at 2.9Ghz, 24 SAS drives, and dual-10GbE. I haven't heard of anyone using dual 10GbE (single is becoming more common, though) or SAS drives (SATA is typical) due to cost constraints.

The Apache blog has a post on the Apache Sqoop architecture, focusing on the role of Connectors and Drivers in connecting to and talking with a database system. There article covers the architecture (and how it allows for database-specific optimizations) and includes a number of diagrams.

Donald Miner, author of MapReduce Design Patterns, recently spoke at the NYC Data Science Meetup about "Hadoop for Data Science." The Mortar blog has a recap, video, and slides for the talk. The talk covers Pig, Mahout, NLP tools, Collaborative filtering, R and Hadoop, and much more.

The Stockholm Hadoop User Group recently hosted Michael Hausenblas from MapR to talk about Apache Drill and David Whiting from Spotify to talk about Hadoop processing tools (Cascading, Crunch, Pig, and Hive). Michael's talk focuses on the motivation for Apache Drill, the architecture, extensibility, and API. David's talk covers the main frameworks, their pluses and minuses, and provides some example code.

The Hortonworks blog has an overview of HDFS snapshots. The post covers the features of snapshots (atomicity, flexibility, scalability), how to enable snapshots, how to take snapshots, and how to recover lost data from a snapshot. Snapshot data is kept in a special .snapshot directory, so recovery is just a matter of copying out of that special directory. HDP 1.3 and HDP 2.0 Beta include support for HDFS snapshots.

Episode #16 of the All Things Hadoop podcast features a chat with Jay Kreps of the LinkedIn SNA (Search, Network, and Analytics) team. Jay is tech lead on the team that has built out a bunch of LinkedIn's data architecture, including Voldemort, Azkaban, Apache Kafka, and Apache (incubating) Samza. There's a recap of the conversation at the link below -- most of the podcast is devoted to Kafka and Samza. In particular, I found the comparison that Jay does between Samza and Storm to be interesting and informative.

While not specifically about Hadoop, this post by Jay Kreps about distributed system reliability is a good read for anyone working on parts of the Hadoop ecosystem (or other distributed systems). He talks about some of the misconceptions in reliability (in particular assumptions about independence when dealing with failure), as well as how important operations are to reliability. He also talks about the rise of *-as-a-service products in distributed systems (we've recently seen a number related to MapReduce), and why these are exciting -- particularly if you're just getting started.

Chris Stucchio's post about when not to use Hadoop has made the rounds this week. The post has a number of interesting points, particularly around what kinds of data are suitable for Hadoop (spoiler: most aren't). Chris says that the cutoff for using Hadoop is around 5TB -- which is probably reasonable but not a steadfast rule (e.g. if you need multi tenancy or you're doing compute-heavy workloads). He also gives a shout-out to Scalding in case you really have to use Hadoop.

Apache Oozie supports EL (Expression Language) functions, which are somewhat akin to Hive or Pig UDFs. This post describes when to use EL functions (and when they're not appropriate), how to write them, and how to deploy them. The entire process is somewhat heavyweight (e.g. you have to restart the Oozie server to deploy the changes), but it allows you to implement logic that you couldn't otherwise in an Oozie definition.

Steve Loughran, co-creator of Hoya (a framework for running HBase on YARN), has written an article about coding and testing of Hoya. Hoya is getting support for Apache Accumulo, which requires some changes to make the project more generic. The post describes the necessary changes in-depth (and making comparisons between YARN and SmartFrog), some general information on YARN testing, and some notes on remote testing.

The Hortonworks blog has a post on Apache Tez, the second in a series. Tez uses a directed acyclic graph to describe a data processing workflow. In the Tez API, vertices describe user logic and edges describe data flow. A number of properties influence the actual execution of the DAG (e.g. how data is moved, scheduled, and data sources). These are outline in the post.


The New York Times has coverage on Apache Hadoop's trudge to a 2.x GA (planned for October). The article gives a high-level overview of the important changes in Hadoop 2.x, the history of the 2.x effort, and the history of Hadoop's early days. There's also some coverage of the Hadoop industry, from Hortonworks to MapR to Cloudera to IBM.

Qubole won the DataWeek award for innovation in Hadoop technologies. Qubole's Hadoop-as-a-Service platform runs on AWS using its own technology rather than Amazon's Elastic MapReduce. It has a GUI that supports Hive, Pig, Oozie, Sqoop, and more. The article notes that Qubole's customers processed over 2.5 PB of data in August.

WANdisco is trying to establish their brand/distribution as the best high-availability solution for HDFS via their "Non-Stop NameNode." They've announced that they've won an OEM agreement with Miaozhen Systems, a leading advertising company in China.

Spotify has grown their Hadoop cluster from 30 nodes to 690 nodes over the past 5 years. At 690 nodes, Spotify's cluster is consider Europe's largest, and they announced that they're turning to Hortonworks to help support it. Starting in October (presumably once Apache Hadoop 2.x is GA), they'll be starting a migration from Cloudera's distribution to Hortonworks'.

Infoworld's Bossie (Best of Open Source Software) awards included a section for big data tools, and a number of projects from the Hadoop ecosystem were recognized. The recognized projects include: Apache Hadoop, Apache Sqoop, Talend, Apache Giraph, Apache Hama, Cloudera Impala, Serengeti, and Apache Drill. The number of Apache Software Foundation projects is pretty striking.


Replephant is a new clojure library for analysis of JobTracker history. This post, which introduces the software, goes through some examples (e.g. finding distinct users, number of jobs per user), motivation, and an explanation of how it works. The post also shows how to use Incanter, an R clone written in Clojure, to visualize datasets generated by Replephant.

CDH 4.3.2 was released. It resolves HADOOP-9150, a potentially performance issue resulting from unnecessary DNS lookups.

Cloudera Manager 4.7.2 was released to fix an issue with parcels when there's a symlink in the parcel directory path.

Pentaho released version 5.0 of their Business Analytics Platform last week. The focus of this release seems to be that of "data blending" -- aka joining of data across disparate datasets (HDFS, NoSQL, etc) to build dashboards and integrate with BI tools. From what I can tell, they've built a SQL front-end to Pentaho that can translate SQL into on-demand ETL via database system-like views of the data.


Curated by Mortar Data ( )

Monday, September 23

Processing 1.4 Trillion Events in Hadoop: A real-world use case with comScore (New York, NY)

Tuesday, September 24

Meetup at Flurry in San Francisco (San Francisco)

Apache Lucene: Then and Now By Doug Cutting (Washington, DC)

Real time Query on Hadoop - Cloudera Impala (Hamilton Township, NJ)

Wednesday, September 25

Ambari User Group Meetup (Palo Alto, CA)

Hadoop Use Cases (Munich, DE)

Big Data Analytics for Video – Real time responses & ad hoc analytics. (San Francisco, CA)

Continuous Deployment with Cassandra: Treating C* As First-Class Code (San Francisco, CA)

Summingbird, Scala and Storm (Sam Ritchie, Twitter) (Cambridge, MA)

Nuiku + Whitepages (Seattle, WA)

Distributed Gradient Boosting Machine for Big Data (Mountain View, CA)

Thursday September 26

How Google Does Big Data (Atlanta, GA)

Hadoop at Spotify and Sanoma Media (Amsterdam, NL)

Solr + Hadoop = Big Data Search (New York, NY)

HBase Meetup at Arista Networks in San Francisco (San Francisco, CA)

Why put SQL on Hadoop?? (New York, NY)

Friday September 27

YARN with Apache Hadoop 2.x Beta & Applications on YARN (Mountain View, CA)

Saturday, September 28

Big Data Science Meetup Event (Fremont, CA)