Data Eng Weekly

Hadoop Weekly Issue #55

02 February 2014

There were a lot of big announcements this week, including the release of MapR 3.1 and Alticscale’s public launch. There were also a number of articles around the promise of Hadoop 2.0/YARN—hopefully folks will start sharing their production success stories soon, too. And Next Big Sound has shared a detailed look into their big data architecture, which is one of several interesting technical reads. Enjoy!


Eric Czech, Chief Architect of Next Big Sound, has written about the evolution of their big data infrastructure. The post is full of interesting information from their network topology to colo provider to how they store and version data in HDFS to analysis with Pig to serving data stored in HBase using Finagle.

BeyeNetwork has a post on an often overlooked differentiator between SQL-on-Hadoop systems and proprietary database systems. In proprietary systems, the query engine, storage format, and file systems are typically tightly coupled. Conversely, many of the SQL-on-Hadoop systems use the Hive Metastore to find data on HDFS and for discovering storage formats. Since the storage layer and storage formats are decoupled from the query engines, it’s easy to switch between e.g. Impala and Hive and Presto.

The Cloudera blog has a guest post by Christian Javet on configuring a four-node virtualized Hadoop cluster using Virtualbox running CentOS 6.5 and Cloudera Manager. The post walks through the various networking and ssh settings required to configure the base image and how to install and access Cloudera Manager, HUE, and more.

The Hortonworks blog has a post about their partnership with Sqrrl, the company offering commercial support of Apache Accumulo. The post talks about some of the applications that have been built atop of Sqrrl Enterprise and HDP, the reference architecture for HDP and Sqrrl Enterprise, and some of the features of Sqrrl Enterprise not found in Apache Accumulo.

MapR supports 5 different SQL-on-Hadoop technologies as part of their distribution. Given the large number of possible solutions, MapR has posted a detailed comparison to help users choose the best technology for their problem. The page covers Hive, Drill, Impala, Presto, and Shark across categories like SQL completeness and UDF support.


Concurrent CEO Gary Nakamura has written an article for Wired with the provocative title “Why Hadoop Only Solves a Third of Growing Pains for Big Data.” Gary explains how most companies get stuck in the “dark valley of Hadoop”—they have lots of data in Hadoop, but finding answers requires time-intensive implementation. He then goes on to talk about ways to get out of the dark valley, in which he talks through a number of ecosystem libraries and tools (and their shortcomings).

The Call for Papers for HBaseCon 2014 ends in under two weeks. The Cloudera blog has a post with some tips for writing a good proposal based upon tips from the HBaseCon Programming Committee. Some of the tips are HBase specific, but there are good ideas for any conference proposal that you might be writing.

Datanami has a post with five reasons to use Hadoop and five not to. The pro-Hadoop reasons are ones that most of us have heard before, such as “Your Data Sets are really big.” Likewise, the con-Hadoop reasons are also ones that have been thrown around for years, such as “You Need Answer in a Hurry.” What’s interesting and promising, though, is that while maybe of the con-Hadoop reasons have been true for a while, 2014 (or perhaps soon after) will see major improvements or resolutions in these areas.

Concurrent, developers of Cascading, have announced that they’re joining the Rackspace Partner Network. Based upon the press release, it sounds like the goal is to make it easier to deploy Cascading applications within the Rackspace cloud.

Altiscale announced their public launch this week. The company’s platform, previously in private beta, is a Hadoop-as-a-Service offering. Unlike other HaaS vendors (many of which run atop cloud systems like AWS), Altiscale runs and operates “auto-elastic" Hadoop clusters in their own data center. Customers are charged based upon the HDFS storage used and MapReduce resources consumed. Altiscale is founded by and staffs a number of ex-Yahoo employees who have experience with large-scale Hadoop deployments.

Syncsort has announced that their DMX-h ETL Edition software is now certified to run on HDP 2.0 with YARN. The software, which focusses on high-performance ETL from any source (including mainframes), is also available for evaluation in the Hortonworks Sandbox. A guest post on the Hortonworks blog has more details about why Syncsort is excited about YARN and HDP 2.0.

There seems to be quite a bit of momentum growing behind Hadoop 2.0 and YARN. An article in SDTimes explores the features and improvements that are attracting new organizations to Hadoop. At the top of the list are the highly-available NameNode, NameNode federation, and the new possibilities created by YARN.


Cloudera has released Cloudera Manager 4.8.1. In addition to resolving several issues, the release adds the ability to distribute Apache Spark (incubating) via parcels on a CDH 4 cluster.

MapR has released version 3.1 of their distribution. The new version includes an improved installation process, improved security (include roll-based access control and wire-level security), performance enhancements to MapReduce, and much more.

Berkeley’s AMPLab has announced a new project called SparkR. SparkR provides a backend for the R statistics software to execute on a Spark cluster. The introductory blog posts gives an example of computing a logistic regression in parallel via Spark from data stored in a file in HDFS.

Foursquare announced the open-sourcing of parts of their platform for loading data from MongoDB databases to Hadoop. The software, written in Scala, includes a tool for dumping data from MongoDB to BSON (binary json, Mongo’s native format) files as well as Hadoop input formats for reading the data within MapReduce jobs. The blog post announcing the open-source project also contains an overview of the process that they used to backup data stored in MongoDB to HDFS.

There is a new version of the Kiji “Dashi” BentoBox. Version 1.4.1 includes new versions of all eleven components in the Kiji stack.


Curated by Mortar Data ( )


Washington State

Winter 2014 Seminar Series: Big Data Infrastructure (Tacoma, WA) - Wednesday, February 5


Demo & Feedback on Hue 3.5 and latest features (San Francisco, CA) - Wednesday, February 5

Meet the Big Data Experts from Cloudera Impala and BIRT (San Francisco, CA) - Thursday, February 6

Hadoop Developer Training - Instructor-Led (Calabasas, CA) - Saturday, February 8

Microsoft Big Data Hackathon, co-hosted by Hortonworks and Elastacloud (Mountain View, CA) - Saturday, February 8 and Sunday, February 9th


Life on Hadoop: Lessons from Building Large-Scale Advertising Pipelines (Champaign, IL) - Monday, February 3


Savanna: Elastic Hadoop on OpenStack (Rocky Hill, CT) - Tuesday, February 4

District of Columbia

Let's get hands on with SQL for Hadoop & HBase - Developer Day (Washington, DC) - Tuesday, February 4


Big Data Analyst with Hadoop (Tel Aviv-Yafo, Israel) - Monday, February 3


Africa Big Data Cohort (Johannesburg, Africa) - Tuesday, February 4


Exploring Flume’s capabilities over Hadoop and Forecasting in R (Delhi, India) - Wednesday, February 5

Chandigarh Hadoop Meetup February (Chandigarh, India) - Saturday, February 8


Toronto Hadoop User Group Monthly Hackathon (Toronto, Canada) - Thursday, February 6


Lyon Hadoop User Group First Meetup (Lyon, France) - Thursday, February 6