05 January 2014
Welcome to the first issue of Hadoop Weekly in 2014. This issue features a few year-in-review/2014 preview articles and a couple of posts tying that theme to Hadoop 2/YARN. There are also a number of interesting technical articles covering running Hadoop in Linux containers, HBase performance tuning, R & Hadoop, and more. From a release standpoint, there are two interesting new projects to try out—SIMR for running Spark on a MapReduce-v1 cluster and PigPen for writing MapReduce jobs (which are translated to Pig) in clojure.
HUE, the front-end for several components in the Hadoop ecosystem, has added a Spark application. The app takes advantage of the Spark Server’s REST API to submit jobs and monitor status. The HUE blog has an introduction describing the functionality and required setup/configuration. There’s also a short video demoing the new application.
The Altiscale blog has a 2013 Hadoop-in-review post, which highlights key achievements of Hadoop 2.x and the exciting things in the queue (such as Apache Tez and YARN HA). In particular, I agree with the sentiment of backward-compatibility being an important and under-reported feature in the 2.2 release.
The O’Reilly blog has a post on the major new features in HDFS and YARN in Hadoop 2. First, there’s a tour of the changes for job execution—moving from TaskTrackers and JobTrackers to ResourceManagers, NodeManagers, and ApplicationManagers. Second, there’s a tour of the new HDFS features—NameNode HA, Snapshots, and Federation.
InfoQ has a guest blog post by Jon Natkins of WibiData on the Kiji framework. Kiji is a system for building so-called ‘entity-centric’ systems, e.g. recommendation engines, using HBase and MapReduce for data storage and batch computation. The article talks about the various components of the Kiji framework that help remove boilerplate and enable rapid development of such a system.
Tuning memory parameters for a Hadoop cluster can be a bit of an art. It typically starts with a somewhat arbitrary allocation and is adjusted as things fail or underperform. The MapR blog has an overview of tuning JVM heaps for MapReduce (there is some MapR-specific stuff but not too much). It includes several good rules to guide you in configuring memory usage.
Linux containers are a resource allocation/isolation mechanism available on modern Linux distributions. YARN can take advantage of Linux containers as does Cloudera Manager with CDH 4. In this post, Linux containers are used for a slightly different task—running a pseudo-distributed cluster on a Linux laptop. The post includes a brief overview of Linux containers, how to install/configure them on Ubuntu, and instructions for solving some of the issues that crop up with networking (since Hadoop is so sensitive to DNS and friends).
The Cloudera blog has post by Chief Strategy Officer Mike Olson on MapReduce and Spark. The post motivates the need for a new framework to succeed MapReduce and why Cloudera is betting on Spark. The bet is interesting for a few reasons—Spark is written in Scala, a relatively new language that runs on the JVM; Cloudera seems to be backing Spark over Apache Tez (without being mentioned by name) which solves the same kinds of problems; and a company, DataBricks, has been founded to support Spark. The whole article is a good perspective from an industry leader on the future of distributed computation.
HBase committer and PMC member Lars Hofhansl has written about HBase performance profiling. The post covers four recently discovered and fixed issues, including a bug that generated extra disk seeks (which when fixed was a 2-3x speedup). There are also some interesting observations about locking that are applicable to other programs using the JVM.
There have been a number of integrations between Hadoop and R announced in the past year or so. Among them is the Oracle R Advanced Analytics for Hadoop (ORAAH). This post includes a walkthrough of computing covariance with ORAAH on a 6-node Hadoop cluster. It also includes an example of doing the same computation on the open-source rmr package, which takes about 4x as long as ORAAH.
If you or someone you know is looking to learn Hadoop, this is a collection of resources to get up to speed. It includes a number of documents pertaining to big data challenges, the basics of YARN/Hadoop 2 (the latest and greatest version of Hadoop), tutorials to get your feet wet, relevant blogs to follow, hardware best-practices, and much more.
TheServerSide.com has a post on the changes in and lofty goals of YARN/MapReduce 2.0. There’s a good analogy in the article comparing Hadoop to a layer cake (and how those layers changed between version 1 and 2). The article also touches on the goal of making YARN into the data center operating system for distributed systems.
Slashdot has an article from LucidWorks CTO Grant Ingersoll on the virtuous relationship between Hadoop and search. Namely, it covers the birth of Hadoop from within the Nutch open-source search engine as well as the building momentum of search over data in HDFS via Apache Lucene and Solr.
A few weeks ago (it just appeared on my radar), Splice Machine and MapR announced a partnership to bring Splice Machine to MapR’s distribution. Splice Machine’s key product is a transactional SQL-on-Hadoop database that marries Apache Derby with HBase and Hadoop.
SiliconAngle has an interview with MapR CEO John Shroeder where he outlines his predictions for big data in 2014. The take-aways include: more SQL-on-Hadoop (and that it’s probably been over-promised), Hadoop will become more operational, and search will continue to gain steam for unstructured query.
PigPen is a new Clojure framework for Hadoop MapReduce. PigPen takes an interesting approach vs. other frameworks—it translates Clojure to Pig Latin, which is executed by the Apache Pig framework. The introductory blog post walks through the features and motivation for PigPen, including composition/code reuse, unit tests, closures, and more. The post also has details on some of the optimizations and why it’s built on Pig. The documentation on the Netflix github page is also pretty fantastic—it has a tutorial and introductions for Clojure and Pig users.
Spark in MapReduce (SIMR) is a new project to run Spark on a MapReduce v1 cluster. It was developed by Databricks and Berkeley AMPLab to enable POC and evaluation of Spark for folks not yet running Hadoop 2.x (YARN). The system works by launching a number of map tasks that run Spark daemons (a leader is elected by generating a unique file in HDFS). Getting up and started only takes a few commands, and there are artifacts for HDP 1.x, CDH3, and CDH4 on the SIMR github page.
MapR has announced availability of Hive 0.12 (which includes Stinger Phase 2) for its distribution. The build also includes security enhancements and some back-ported fixes from Hive trunk (code is available on github). Interestingly, unlike other Hadoop distributions that tie all Hadoop projects tightly together into a giant release, MapR is providing the update to Hive without requiring updates of other parts of their distribution.
Apache HBase 0.94.15 was released. It resolves 30 issues, including a number of performance fixes. Users of 0.92.x and 0.94.x can do a rolling upgrade.
Curated by Mortar Data (http://www.mortardata.com)
Exploring Enron Email Dataset with Kiji and Hive; Apache YARN and Apache Tez (Washington, D.C.)
Moving a Big Elephant! (Hyderabad, India)
Impala: A Modern, Open-Source SQL Engine for Hadoop (Toronto, ON)
Beyond Hadoop, especially Spark (Bangalore, India)
Houston Hadoop Meetup Series (Houston, TX)