Data Eng Weekly

Hadoop Weekly Issue #44

17 November 2013

The deadline for submitting an abstract for Hadoop Summit Europe is this Friday, so get your talks ready. There were a lot of releases and product/project announcements this week from community/commercial projects, and a new free class to learn Hadoop and MapReduce was launched this week on Udacity. In addition to all the new stuff, there were several high quality technical articles.


Cloud backup service Backblaze has 75 petabytes of raw storage spread across over 25,000 hard drives. Their blog shares analysis of hard drive failure and lifetime across their infrastructure (which uses consumer-grade hard drives). Since HDFS clusters are typically formed out of similar components, there's a lot of interesting information for Hadoop operators. One particularly interesting observation is the bathtub curve for failure rates -- failure rates decrease over the first 18 months, plateau for a while, then rise again around the 3 year mark.

Nick Dimiduk, co-author of "HBase in Action," has written a two-part blog post on the interface between Hive and HBase. In part one, he covers the state of affairs in the Hive-HBase integration, including schema mapping, classpath configuration, and the known bugs/unsupported features/etc. He concludes that the integration is immature, but there's a lot of low-hanging fruit that could rapidly improve the integration. Part two is a walkthrough of loading data into Hive, populating a HBase table from Hive, and inspecting the HBase data both in Hive and from the HBase shell.

The authors of BinaryPig, a framework for doing malware analysis with Apache Pig, have written a post about their software on the Cloudera blog. The post describes how they serialize malware samples to sequence files stored in HDFS, use the distributed cache to distribute analytical scripts, utilize existing malware existing tools such as various forms of hashing, and load results into ElasticSearch for manual exploration and automated extraction.

I often find that working through a concrete example really helps with understanding a new concept. A new post on Dr. Dobb's by Michael Hausenblas of MapR is a great introduction and example of the Lambda Architecture (the LA is a generalization of common patterns for combining batch and real-time computation to develop web-scale data-driven applications). The walkthrough covers designing a system for the fictional "UberSocialNet" (USN). The tutorial uses Hadoop, Hive, HBase, and Python -- and the code for the project is available on github.

Jeff Magnusson, Manager of Data Platform Architecture at Netflix, gave a talk at QCon SF on the Netflix Big Data Platform as a Service. Netflix's data infrastructure uses Amazon Web Services. The slides cover some of the advantages of this architecture (e.g. ad hoc clusters to supplement long running ones) and the tools that Amazon has built to support their system. In particular, it covers Genie - the open-source Hadoop PaaS, Franklin - the metadata api abstraction, Forklift - a system for moving data between platforms (e.g. DBMS and S3), and more.

InfoQ has an interview with Eva Andreasson of Cloudera. The interview covers Cloudera, the popularity of Hadoop, and several components of the Hadoop ecosystem - MapReduce, HDFS, HBase, and more. The conversation doesn't assume any Hadoop background, and it's a good background for someone getting started. The InfoQ site has both a video and a transcript of the interview.

The data chef blog has a post describing how to model linear regression, which can be hard to compute in parallel, as a minimization problem. With this formulation, the problem can be solved with gradient descent. The post covers how to implement this algorithm in Pig using Pig macros, UDFs, and a driver program written in Python. The post has detailed explanations of each portion of the implementation and includes illustrations to drive home the details.

The MapR blog has a post about choosing the right algorithms. The post is a follow-up to a talk given by MapR Chief Application Architect Ted Dunning, and it expands on some of the points made in the slides. Specifically, a practical algorithm needs to be deployable, robust, transparent, skillset and mindset matched, and proportionate. The post elaborates on each of these features, providing guidance for building practical, scalable machine learning algorithms.


The deadline for submitting Abstracts for Hadoop Summit Europe 2014 is this Friday, November 22nd. The conference takes place in Amsterdam on April 2-3, 2014.

Twill is a new Apache incubator project that provides an abstraction over Hadoop YARN. Twill is the successor to the Weave project that was started by Continuuity. Twill/Weave is the foundation of some production software at Continuuity, so it is a proven framework.

Pivotal announced Pivotal One, which builds upon Pivotal CF (Pivotal CF is an enterprise version of Cloud Foundary, the open-source cloud computing PaaS). Most notably for readers of this newsletter, Pivotal CF now supports Pivotal's Hadoop distribution, Pivotal HD (including an integration with their SQL-on-Hadoop system, HAWQ).

Online-learning website Udacity has launched a new "Data Science and Big Data Track." The first course offered in this track is an "Introduction to Hadoop and MapReduce," which is available for free in self-paced mode. Udacity partnered with Cloudera to write the curriculum and content for the course, and the course is instructed by two members of Cloudera's Educational Services team.

Oracle has announced its new Big Data Appliance X4-2, which includes Cloudera Enteprise and provides capacity of up to 864 TB. The appliance combines Oracle systems with Impala and Cloudera Search, providing a wide-range of data processing options. Oracle also announced its support for the Apache Sentry project, whose aim is to provide fine-grained authorization of data stored in Hadoop.


Scala-cassandra is a new project that wraps the Java CQL driver and presents a scala interface.

Spring for Apache Hadoop 1.0.2 GA was released. This version supports both Apache Hadoop 1.2.1 and 2.2.0 as well as Clouder CDH 4.3.1, Hortonworks HDP 1.3, and Pivotal HD 1.1. The second milestone release of the 2.x branch was also announced.

Version 2.1.2 of snakebite, the python library for HDFS, was released. This version adds support for automatically loading configuration from hdfs-site.xml.

Oryx is a new open-source project for real-time machine-learning. It has implementations of alternating least squares for collaborative filtering, random decision forests for classification, and k-means++ for clustering. It has a distributed computation layer implemented atop of MapReduce. It also provides a serving layer, which provides a REST API for serving up data and performing updates to the models in real-time.

WANdisco announced version 1.5 of Non-Stop Hadoop for Hortonworks. The software improves support for synchronizing data between Hadoop clusters in multiple datacenter. Version 1.5 adds support for Apache Ambari, HDP 2.0, and more.

Hoya, the project for running Hadoop on YARN, released version 0.6.2 this week. The release is built against Apache Hadoop 2.2.0 and HBase 0.96.0. New features include support for multiple masters and role history. Role history allows Hoya to reuse the same underlying server for each region after a cluster shutdown/startup, which improves cluster spin up time.

Syncsort announced an integration with Amazon Web Service's Elastic MapReduce (EMR). Syncsort is offering its ETL tool, Ironcluster, as an EMR add-on in the Amazon Marketplace. Ironcluster on EMR is free for up to 10 nodes, and they have online documentation providing examples and templates.


Curated by Mortar Data ( )

Monday, November 18

Hadoop Machine Learning Project (Austin, TX)

Tuesday, November 19

November Hive Contributors Meeting (Palo Alto, CA)

ACUG November Meeting - Hadoop In The Cloud (Austin, TX)

Programming Hadoop: MapReduce using Python and an Intro to Pig (Madison, WI)

Big Data is a Big Deal - Running Hadoop In The Cloud (Mountain View, CA)

St. Louis Hadoop Users Meetup (St. Louis, MO)

Big Data Warehouse - Hive, Hive2, Impala+Paraquet Tables

Wednesday, November 20

Big Data Festival KC (Leawood, KS)

Bay Area Hadoop User Group (HUG) Monthly Meetup (Sunnyvale, CA)

Hadoop 101 (Boulder, CO)

November Meetup (Pittsburgh, PA)

November SF Hadoop Users Meetup (San Francisco, CA)

New Data Structures for BIG Data + Map Reduce Algorithms (San Ramon, CA)

Hadoop Machine Learning Project (Austin, TX)

Hadoop 2 is here! (Toronto, Ontario)

Realtime data analytics at Datadog (New York, NY)

Thursday, November 21

Hadoop, MapReduce OSP-Con (Moscow, Russia)

Hadoop Ecosystem & Use Cases (Munchen, Germany)

Introduction to Spark (Cambridge, MA)

Saturday, November 23

Big Data Science Meetup Event (Fremont, CA)