Data Eng Weekly

Hadoop Weekly Issue #26

14 July 2013

Videos from last month's Hadoop Summit have been posted on youtube, so there's plenty of content to keep you busy for weeks to come. In addition, this week's newsletter features articles covering Cloudera Morphlines, HBase, Blur, and more, as well as the 2.0-beta release of Apache Cassandra, which packs some powerful new features.


Cloudera Morphlines is a system for doing light-weight ETL by specifying a configuration file defining transformations on records of data. Morphlines was recently integrated into the Cloudera Developer Kit, which includes a runtime to compile morphlines to java classes (providing performance benefits). This post includes an overview of the processing model, data model, and common use cases as well as a tutorial that uses morphlines to index data in Apache Solr that originates in syslog.

A video of the June Apache Kafka Meetup was posted on youtube. There were talks on the 0.8 beta as well as use case studies from folks at RichRelevance, Netflix, and Square.

Cloudera Manager provides cluster management for components in the CDH stack -- providing service configuration, lifecycle management, and support for various hard to setup features (such as NameNode HA). This post discusses the inner works of Cloudera Manager, covering everything from the data model / vocabulary to the agent-server architecture to monitoring and metric collection.

Videos and slides of presentations at Hadoop Summit in June were posted online. With two days of talks across seven tracks, there is a ton of high-quailty content to consume -- from the inner workings of YARN to using Hadoop for business intelligence.

The Apache Incubator project, Blur, provides a mechanism to index and search data stored in HDFS using Apache Lucene, Apache Zookeeper, and Apache Thrift. Blur predates recently announced systems from Cloudera and MapR, and also has a different architecture (the others are built using Apache Solr).

This tutorial covers building a recommendation engine for Hadoop with Microsoft Azure's HDInsight, C# .NET, and Mahout. The tutorial uses an XML-encoded dataset from StackExchange and Mahout's Similarity concurrence algorithm to predict questions that a user might answer based upon her history.

In the third part of the series on the Apache HBase REST API, Jesse Anderson covers the multi-get API. The post has some sample code in python to parse both XML and JSON responses as well as examples of using curl to fetch data.


Somehow I missed this excellent post about the genesis of YARN in last week's issue. Wired has an interview with Arun C. Murthy about the 3am "phone call that changed big data" (i.e. when Arun first realized that MapReduce wasn't enough), and it has some choice quotes like, "Who the hell writes systems software in Java?" It's a good read about the evolution of Hadoop.

Syncsort produced a survey of 300 big data conference attendees in March and April. They found that 60% of respondents were using or experimenting with Hadoop, and that Cloudera's distribution had the highest adoption (followed by Apache Hadoop and Hortonworks HDP).

ZDNet has an interview with Hadoop founder, Doug Cutting, about the past, present, and future of Hadoop. Doug says that the killer query features -- SQL and search -- are already built into Hadoop, and that the future is going to focus on tools that increase adoption. One example is Apache Accumulo-style cell-level permissions on data in HBase.

Think Distributed is a new podcast covering Distributed Systems. The first episode, featuring panelists from Basho, Stanford, and UC Berkeley, covers distributed consensus algorithms, in particular the Raft algorithm. I haven't listened yet, but there have been a lot of good things said in the twitter-verse.

Merv Adrian from Gartner posted part one of his recap from Hadoop Summit. He makes a few insightful points -- Hadoop is growing, YARN is taking much longer than everyone expected, and vendor offering fragmentation is causing a lot of confusion.

MapR's distribution has been available as an option in Amazon Elastic MapReduce for a while now, but this week MapR announced that the latest version of their distribution, M7, is now available on AWS, too. Pricing for M3, M5, and M7 is tiered and in addition to the regular EMR charges. MapR says that M7 can run on all instance types, and they've seen over 100k ops/sec running benchmarks on SSD instances.

Continuuity, which was co-founded by Todd Papaioannou (formerly VP at Yahoo) and Jonathan Gray (formerly HBase engineer at Facebook), announced that Papaioannou has stepped down as CEO and that Gray has taken over. Papaioannou remains an advisor.


The Buri Bento Box was released -- it contains updates to KijiSchema, KijiMR, KijiRest, and KijiExpress. Most notably, KijiREST now supports bulk POST operations.

Cloudera ODBC 2.0 Connector for Microstrategy. This version uses HiveServer2 and supports CDH4.2+ and Impala 1.0+.

Cloudera Manager 4.6.1 was released. This version fixes two high priority bugs -- the first in the 4.5.x->4.6.x upgrade and the second with setting up NameNode Federation.

Cloudera Developer Kit 0.4.1 was released. It contains a bunch of updates to the recently released Morphlines library, including updates to documentation and examples.

Cassandra 2.0.0-beta was released -- it's considered a beta or preview release, but it has a number of notable features. The new features include: eager retries, improved compactions, triggers, and compare-and-set.