Data Eng Weekly

Hadoop Weekly Issue #45

24 November 2013

This is one of the smallest editions of Hadoop Weekly in a while, but it's full of great articles. One of my favorites is an overview of explaining Hadoop to a non-geek. I've also included a link to the article that sparked a bit ofcontroversy on the twittosphere around benchmarking of SQL on Hadoop. It also looks like MapR is getting ready for an IPO given that they've appointed a CFO with IPO experience. Enjoy!


A post by Gruter CTO and Chief Architect Hyeong-jun Kim sparked a lot of conversation and controversy this week. After providing one of the best summaries of the various SQL-on-Hadoop products, the post covers the key performance considerations of SQL on Hadoop engines -- query planning, file format, and data scan speed. In regards to the latter, the author observes that query performance is limited by the overall throughput of the disks in HDFS. Given this constraint, it's really difficult to get more than 1.5-3x the throughput of Hive. The last point, which has stirred up quite a storm, is that performance comparisons tend to cherry-pick short queries, which suffer under Hive given MapReduce overhead.

Hadoop management software tends to usepasswordless-ssh as the default mechanism for bootstrapping the agents that run on each node in the cluster. Using this strategy has security implications, so there's often a second option to configure the agents yourself. For Cloudera Manager, this is referred to as "Install Path B", and the following post covers using Puppet (a configuration management system) to setup all the management daemons needed for Cloudera Manager.

If you've ever worked with Apache Oozie for workflow management on Hadoop, you're familiar with verbose XML job definition and configuration. Cloudera has recognized this issue and is working on a number of improvements to make job definitions more concise. This post summarizes the various ways to keep a workflow definition shorter.

Databricks, the company founded to commercialize Apache Spark (incubating), has written a guest post on the Cloudera blog. The post covers the types of computation in which Spark excels (usually when the dataset will fit in memory), Spark's interactive analysis, iterative algorithms, real-time stream processing with Spark, and more. Given that Cloudera has just announced support for Spark, I suspect we'll start to see folks using Spark to solve new problems or improve existing applications.

The Cloudera blog also features a post on backup and disaster recovery for HBase. There are a number of possible strategies to do this, from snapshots to DistCP of files in HDFS. The post covers six such strategies, including details of how most are implemented. There's also a handy table comparing the various approaches.

Finding the right log in Hadoop has always been more of an art than a science -- with data for map/reduce tasks, tasktrackers, datanodes, and more scattered across many different nodes and directories. Given the architecture change in YARN (namely the removal of long-lived tasktrackers), a new strategy was needed to manage and investigate log files. The Hortonworks blog has a detailed post about the status of logging in Hadoop 1.x, the new log aggregation system in Hadoop 2.x, and the usage & administration of the new system.


MapR announced that they’ve hired Dan Alter as their CFO. The announcement mentions that Dan has led former companies through IPOs, foreshadowing MapR’s future. Alongside the appointment, MapR revealed that they have over 500 customers.

If you're working on Hadoop (or any distributed system), it can be really challenging to explain what it is that you do to a non-techie. InformationWeek has pulled out the key points from a video by Mike Gaultieri of Forrester explaining Hadoop to non-geeks. It contains some great tips for explaining a complicated topic.

Supercomputer company Cray Inc. has announced a new hardware/software Hadoop solution aimed at the scientific computing industry. The Cray Framework for Hadoop package is built for running Apache Hadoop on the Cray XC30 supercomputer. Technical details are sparse in the press release, but there is mention of support for the Lustre file system.


Tajo, the Apache incubator project, released version 0.2.0-incubating. Tajo calls itself a 'big data warehouse system on hadoop' -- aka an SQL-on-Hadoop engine. As far as I can tell, it has similar features/goals to many of the other SQL-on-Hadoop engines, but it has some interesting differentiators such as a focus on ETL, task retry support for long-running queries, and an IDL layer that is closer to PostgreSQL than MySQL (like many other projects). This release has a number of improvements such as a cost-base join optimization and table subqueries.

Cascading Lingual, the SQL interface on Cascading, has reach version 1.0. Lingual supports ANSI SQL-99, has a catalog of database tables, and it supports loading data from various systems as part of a single SQL query. I haven't used Lingual, but my impression is that it's positioning itself as a more complete/compatible Hive competitor vs. one of the low-latency SQL on Hadoop systems (but I suspect they'll get there, too).

Alongside the release of Lingual, Concurrent announced the release of Cascading 2.5. This new version adds support for Hadoop 2, improved performance on complex joins, and improved compatibility with a number of Hadoop vendors.


Curated by Mortar Data ( )

Monday, November 25

Hadoop Overview and Big Data Trends (Dubai, UAE)

Tuesday, November 26

Building Efficient Solutions with Spark and Cassandra | Node.js for Cassandra (Madrid, Spain)

Wednesday, November 27

Writing Hadoop jobs in Scala using Scalding (Barcelona, Spain)

Thursday, November 28

Solution Scrum (Toronto, Ontario)