Data Eng Weekly

Hadoop Weekly Issue #10

24 March 2013

Hadoop Summit Europe was this week (I mistakenly stated last week that the summit had already occurred, apologies!). Both related and unrelated to the summit, there were a number of releases, interesting technical posts and presentations, and new product and project announcements.

The Apache Hadoop (and related projects) community seems to be working really hard on all sorts of new and existing projects. In fact, this week's newsletter contains four newly open-sourced projects. It's really exciting to see all the effort that's going towards open-source.


Kafka 0.8 has several new features, including intra-cluster replication. As a result, deployment and configuration is slightly different from older versions. This tutorial covers building the source from trunk (and using Scala 2.9 if you prefer), starting the brokers and required services on a single instance, and capturing data through the system via console input and output.

Monash Research provides some context to the recent SQL-on-HDFS craze based upon the cardinal rules of DBMS development. The first rule being that "Developing a good DBMS requires 5-7 years and tens of millions of dollars." There is also some interesting discussion of data formats for HDFS.

An interview with Jay Parikh, VP of infrastructure engineering at Facebook, touches on Facebook's offline data processing infrastructure. Topics include Facebook's plans for going realtime, for contributing improvements to Hive, and for improving the speed of graph-based analysis.

Hortonworks published a two-part post this week about analyzing tweets using Apache Hive. The first part highlighted a tool for detecting Hive schemas from json documents and the second one demonstrates how to load the data into Hive and do simple analyses using the command line.

Hortonworks' "Stinger Initiative" is a plan/roadmap to make Apache Hive 100x faster. The initiative is a many-pronged approach, and one of the prongs is to improve Hive's query planner and executer. In this article, Hortonworks showcases two big improvements that have been committed to an unreleased version of Hive for some non-trivial queries. The first query, runs a join over many tables from a snow-flake schema in which the dimension tables fit in RAM and shows a 35x improvement. The second query is a join over two large fact tables, and they see a 45x speedup. This is one of the first steps towards reaching 100x speedup, but it's already quite impressive.

This article has a lot of details about setting up a Hadoop cluster on EC2, and it's the first full example I've seen of setting Ambari and HDP on AWS. Many of the instructions should be transferable to a non-AWS deploy, too.

The Call for Proposals is open for StrataConf / Hadoop World which takes place in New York in October. The CFP ends May 16th.

Funnel analysis is the process of analyzing user behavior during a multi-step flow during which some users fall out along the way. The datasets for a funnel can quickly become large, and MapReduce is a good fit for performing the analysis. The data folks at Etsy talk give a good overview of funnel analysis, how they uses MapReduce for it, and some tips and tricks they've learned along the way.

Hortonworks sandbox is a pre-configured VM with the Hortonworks Data Platform. It doesn't include baked-in support for R and R's MapReduce bridge, but this tutorial shows you how to configure the two in a few short steps.

Allen Wittenauer shares more details about the evolution of the Hadoop deployment at LinkedIn. His presentation covers their hardware, their move to the capacity scheduler, BCFG2 for config management, and much more.

announcements and releases

Hortonworks released a new version of their Sandbox, and this release contains updated tutorials and learning materials as well as Apache Ambari (for config management).

Microsoft has announced public availability of a preview of their Apache Hadoop service, Windows Azure HDInsight. During the preview period, Microsoft is offering a 50% discount.

Drawn-to-scale offers an SQL-on-HBase solution called Spire. Spire is modeled after Google's F1, which means it is a full SQL solution -- not just optimized analytics queries. This week, they announced that they've implemented a mongodb query-processor for spire, which is a drop-in replacement for mongodb. And with spire, you can gain the ability to make SQL queries over your mongo data and run MapReduce jobs over the underlying data stored in HBase.

WebHDFS, which provides a REST API to HDFS, is available in many Hadoop distributions. We've seen a number of implementations pop up for non-jvm languages, and the latest is a version for Perl and is available on CPAN. It supports both insecure and secure deploys.

Pydoop 0.9.0 was released with support for CDH 4.1.2 and Apache Hadoop 1.1.2. It also supports the new parcel format for Cloudera Manager.

Cloudera has introduced a new suite of machine learning tools called Cloudera ML. The tools are focusing on k-means clustering to start with, and include MapReduce jobs to evaluate the quality of the generated clusters on the full data set. Future works includes HCatalog/Hive integration, support for pivot tables, and ensemble classifiers.

Chronos is AirBnB's workflow and scheduling framework. They call it a replacement for cron, but it is much more than that -- it runs on Mesos, can execute arbitrary shell/bash, and also includes a UI for CRUD operations on jobs in the system. All operations are exposed via a REST api, and they have examples with curl from the command line.

Datastax has released version 3.0 of OpsCenter Enterprise, its management software for Cassandra. This version has a number of new features, such as visual cluster creation and management and DSE security and encryption. It also contains a number of bug fixes.

Voldemort 1.3 was released. Improvements in this release include a new storage layer, better read-only performance, and monitoring & admin APIs enhancements. New features include avro and kerberos support.

HDP 2.0 alpha 2 was announced. HDP 2.0 is built from the Apache Hadoop 2.0 branch, which means it is based upon Apache YARN. The alpha 2 release has two major new features -- the introduction of Apache Tez to the stack and improvements to Apache Hive (see link to Hortonworks blog post about some of these improvements above). Tez is not enabled by default, but there are instructions to enable it as part of the documentation guide.

White Elephant is a system for analyzing Job Tracker history logs to analyze cluster utilization. It can be used for schedule planning (to find times the cluster is under utilized), capacity planning, and billing. It includes scripts to upload logs, MapReduce jobs to analyze and aggregate the logs, and a server to view the aggregate statistics.

AMP (from UC Berkeley) has started "AMP Camp Big Data Mini Course" which is an online tutorial/course for setting up and running through some examples with the Hadoop stack. The course focuses on Spark, the distributed computation engine that supports data caching and is written in scala (Spark also has an SQL interface called Shark).

industry news

Concurrent, the makers of Cascading and Lingual, have raised 4 million in Series A funding and appointed a new CEO.

MapR, the maker of MapR Distribution for Hadoop (M3, M5 and M7) and the original proponent of Apache Drill, announced it that it has raised $30 million in additional funding. This brings MapR's total funding to $59 million.