Data Eng Weekly

Hadoop Weekly Issue #31

19 August 2013

Welcome to a special Monday issue of Hadoop Weekly. There's a ton of content in this issue -- details on Microsoft's new YARN framework, several new episodes of the All Things Hadoop podcast, and new releases from Hortonworks and Cloudera. This is one of the biggest episodes of the summer, so enjoy all of the excellent content!


I often joke that I only use Hadoop MapReduce to lots and lots of counting, so it's fun to see a blog post with a great title like "Using Hadoop to Explore Chaos." This post covers calculating the lyapunov exponent, which can be a a good measure of how chaotic a system is. To do this with MapReduce, each map task is given a portion of the n-dimensional space for which it calculates the lyapunov exponent across all parameters in its portion. The post covers how to do this with Pig (including links to source code) and also shows how to build a script for visualizing the results.

The Hortonworks blog features a post on running the Hortonworks Sandbox on OpenStack. The post covers the steps of converting the virtualbox sandbox image to one compatible with OpenStack, booting the sandbox, and accessing the sandbox inside of OpenStack.

Microsoft announced REEF (short for Retainable Evaluator Execution Framework), a new framework built on YARN. Details are a little sparse (REEF hasn't yet been released or open-sourced), but REEF appears to be aimed at iterative machine learning jobs, which have different requirements than traditional MapReduce applications. There are some more details on the components of REEF in the article below.

The All Things Hadoop podcast was busy this week with 4(!!) new posts (Episode #11 was actually last week, but I missed it). The new episodes feature interviews with K Young of Mortar (Episode #11) talking about Mortar's Apache Pig-based platform, Mark Grover and Roman Shaposhnik (Episode #12) talking about Apache BigTop (the integration framework for the hadoop stack), Camille Fournier (Episode #13) talking about Apache Zookeeper and applying open source community practice to non-open source development, and Alan Gates (Episode #14) discussing the Stinger initiative to make Hive 100x faster. As always, the podcasts are of high-quality and there's a recap of each on the links below.

The MSDN Blog is running a series focussed on Hadoop for .NET developers. The first 5 posts in the series are up, covering Hadoop, its architecture, configuring a .NET development environment, setting up an Azure Cluster, and obtaining sample datasets. The rest of the post will focus on HDFS and MapReduce, and a new series will cover higher-level frameworks like Hive and Pig.

Hadoop management systems like Apache Ambari and Cloudera Manager typically co-exist with a configuration management system that setup machine components and base software, such as java, ntp, lzo, etc. In this post, the author covers using Ansible to configure the base packages and setting up Cloudera Manager and Cloudera Search.

Avery Ching, of the data infrastructure graph processing team at Facebook, has posted a thorough analysis of Apache Giraph usage at Facebook. If your not familiar, Giraph is a processing framework that uses the bulk-synchronous parallel computing model, which makes it for graph computation problems. It runs inside of MapReduce or YARN. The post highlights a lot of interesting work that's been done to scale Giraph, from decreasing the memory footprint to implementing multithreading to sharding aggregation. The article shares some interesting performance numbers for 1 trillion node graphs.

Hortonworks has a tutorial on configuring HDFS's NFS support within the Hortonworks sandbox and then accessing the data over NFS from a Mac laptop. There are a few manual steps, but the end results is quite fantastic. I recall seeing and being amazed by MapR FS's NFS support for the first time a few years ago, and it's great to see NFS support finally coming to Apache HDFS. There are still some rough edges, but I'm sure that the management software will smooth them over pretty soon.

Hue 2.5 added a HBase browser, which supports table creation, data updates/inserts, and browsing. In the first tutorial in a three-part series, the Hue blog has an overview of creating a table, inserting data via MapReduce, and using the Hue HBase API to insert data into an existing table.

InfoWorld has a post from Platfora CEO Ben Werther on what problems the Platfora platform aims to solve better than a traditional data warehouse / data mart solution. The article doesn't really get into the details about Platfora's architecture until the second page, but the first page has a good overview of the flexibility that Platfora is offering as compared to a traditional system.

Hadoop is known for handling all kinds of unstructured data, but it can be a challenge to work with different types of data formats. XML is a very common machine interface format, and it is often a format for various datasets that make their way into Hadoop. In this post, the author shows how to prepare XML data for use in MapReduce, and how to use the xpath_string hive udf to parse xml data within Hive.

Qubole is a Hadoop-as-a-service platform that runs in Amazon EC2. It seems to focus predominantly on Hive -- the founders are ex-Facebook, where Hive was originally build and who is known for being an extremely heavy Hive user. But Qubole also offers ETL integration with a number of systems (e.g. S3, MySQL, Postgres, RedShift), and it has a few other differentiators. In this post, Christian Prokopp covers Qubole's offerings in depth. There are some really interesting things being done in the Hadoop-as-a-service area, and the competition should be really good for end users -- predominantly those just starting up.

One of the authors of a paper entitled "Hadoop’s Adolescence: An analysis of Hadoop usage in scientific workloads" that's appearing in VLDB 2013 has posted a summary of the paper. Using data collected from three hadoop clusters over several months, the authors looked at workloads (e.g. single job or chained jobs, which frameworks), tuning (i.e. if a job was rerun with performance-tuned params), and resource usage and sharing. The post highlights some interesting findings, such as the majority of jobs are a single MapReduce and that a lot of jobs are really short and run over small datasets.

Riot Games undertook a migration from a MySQL+Excel data warehouse to a Hadoop-based data warehouse featuring Platfora, Oozie, Hive and more. The article claims that the migration only took 3 months, which is really impressive. Included in the article are four tips for anyone deploying Hadoop, which all really hit home.

The Cloudera blog has a good overview of so-called 'short-circuit local reads' which improve performance of accessing data in HDFS when processes are collocated on the same host as a DataNode. The post is an easy to understand walkthrough of the evolution of the short-circuit read design, whose goal is to improve performance by avoiding the network overhead of TCP and the DataTransferProtocol.


The Cloudera blog has an interview with Tom White, who (among other things) founded the Apache Whirr project. For those that don't know, Whirr is a set of tools for spinning up, configuring, and running clustered systems (e.g. Hadoop or Zookeeper) in the cloud. It supports multiple different clouds providers, such as Amazon EC2 and Rackspace. In the post, Tom talks about how he got started with Hadoop in EC2, the early days of Whirr, and balancing quality of contributions to Whirr.

Hortonworks announced that it's reached a reseller agreement with CSC, the consulting firm. CSC is a relative new-comer to the Hadoop community, but it was also in the news last week for scooping up the Infochimps big data platform. Under the agreement, CSC can sell subscription support for Hortonworks HDP. It should be interesting to see how this helps Hortonworks grow and try to catch up to Cloudera.


Hortonworks has announced Hortonworks Data Platform v1.3 for Windows, which runs on Windows Server 2008 R2 and Windows Server 2012. The distribution includes new versions of several components, including: Hive 0.11, Flume 1.3.1, HBase, Mahout 0.7.0, and Zookeeper 3.4.5 and brings parity between the Windows and Linux distributions. The Linux version of HDP 1.3 was released in May, so it's taken a while to for the Windows HDP to catch up, but it's big news nonetheless.

Huawei announced the availability of the hindex project, which implements secondary indexing in HBase. The code, which is compatible with Apache HBase 0.94.8, is available on github. Features include multiple indexes on a table, bulk loading to indexed tables, and custom load balancer to co-locate index table regions with table regions. The implementation uses HBase co processors, and Huawei plans to integrate it into the Apache HBase project.

Rubydoop, which provides native access to Hadoop APIs via JRuby, was updated to version 1.1.0. This version includes support for YARN, performance improvements, and improved gem compatibility.

Cloudera Manager 4.6.3 was released to address a critical bug fix affecting performance. It's recommended for all users of Cloudera Manager 4.6.x.

Cloudera Search Beta 0.9.3 was released. It includes support for Kerberos authentication, a new version of morphlines/CDK, and Lily HBase indexing support. Search beta requires CDH 4.3 and Cloudera Manager 4.6.


Curated by Mortar Data ( )

Tuesday, August 20 HBase Meetup at Flurry in San Francisco San Francisco, CA

Wednesday, August 21 Pig vs. MapReduce: When, Why, and How (with Donald Miner) New York, NY

Thursday, August 22nd Hadoop for Data science ft. Donald Miner New York, NY