Data Eng Weekly

Hadoop Weekly Issue #82

10 August 2014

We’re in the midst of a summer lull, so this week’s issue is shorter than usual. The lack of quantity is made up for in great quality, though. Technical posts cover YARN, HBase, Accumulo, and building an EMR-like local dev environment. There is also news on Actian, Adatao, Splice Machine, and the HP-Hortonworks strategic partnership. Hopefully there’s something for everyone!


The Hortonworks blog has a post on the ongoing work to improve the fault-tolerance of YARN’s ResourceManager (RM). This post describes phase two of the RM restart resiliency work, which aims to keep existing YARN application running during and after an RM reboot. The post covers the architecture of the solution, including which cluster state information is stored where.

Hortonworks has another post in their series on curated Hadoop Summit content. This time, it focusses on Hadoop security. They highlight four sessions covering recent improvements in Hadoop security, security for the Apache Knox Gateway’s REST APIs, using Hadoop for threat detection, and the future of Hadoop security.

The Apache blog has a post with updated performance evaluations of various HBase BlockCache configurations. They find that the on heap LruBlockCache performs best and the next best configuration is the CombinedBlockCache:OffHeap (a hybrid L1 LruBlockCache and a L2 BucketCache which stores data offheap). The post has details on the experimental setup and a deeper analysis of the results.

An obstacle of adopting AWS’ Elastic MapReduce (EMR) can be building a local dev environment that matches EMR. While Amazon’s distribution isn’t open-source, this post describes how to setup an approximate local environment on a Mac. It shows you how to make configuration changes for s3 uris, sets up the AWS access keys, and add LZO compression support to Hadoop.

This is a good introduction to Apache Accumulo, the distributed key-value store built on HDFS. It describes the architecture at a high-level, contrasts it to Apache HBase, describes the data model (including column visibility), several use cases, and more.

This post looks at using Hadoop and new libraries for iterative computation, such as k-means clustering. It describes Iterative MapReduce, the Twister Programming Model, the Collective Model (the Harp project), and more. There are some experimental results of various frameworks for PageRank, K-means, and broadcast.


Videos of the 2014 Accumulo Summit, which took place in June, have been posted online. There are presentations from folks at Sqrrl, Cloudera, Hortonworks, and more.

The Hortonworks blog has a post from the HP team on the recently strategic partnership between Hortonworks and HP. It has some specifics on the partnership—Apache Ambari will be integrated with HP Operations Manager i (OMi).

Adatao announced a $13M Series A round of founding. The company makes pInsights, a predictive analytics and business intelligence solution built on Apache Spark. They also make pAnalytics, a system aimed at data scientists.

Relational database on HBase startup, Splice Machine, announced that its Series B round was increased by $3m to $18M in total. The latest money comes from Correlation Ventures.

Outspoken Hadoop skeptic and prolific DMBS researcher/creator Michael Stonebreaker has written a post with the provocative title “Hadoop at a Crossroads?” He argues that with the death of MapReduce (focussing in particular on next-generation SQL-on-Hadoop systems), Hadoop (and its vendors) are on a collision course with data warehouse systems. The post also questions the future of HDFS, which he predicts might fall victim to specialized storage layers.

Actian recently announced the Actian Vector Hadoop Edition, which is a SQL-on-Hadoop system. This post has more details on the integration, including how Actian uses HDFS (it has a proprietary file format) and YARN.

Datanami has a post on Sinequa, makers of enterprise search software. The most recent version of their software adds support for analyzing data stored in HDFS and a handler for Apache Mahout to perform analysis using its algorithms.

GigaOm has an article exploring some of the recent momentum of Apache HBase. While Cassandra and MongoDB have seen a lot of press coverage and adoption, HBase is gaining steam. Specifically, it has good integration with the Hadoop ecosystem and a number of companies are starting to build applications on top of it (e.g. Continuuity’s reactor and Splice Machines relational database).


Apache Drill 0.4.0 was released this week. Drill is general purpose analytics software that strives to build a more general framework than existing systems (i.e. SQL-on-Hadoop) by supporting a wide variety of storage systems/formats and queries. The 0.4.0 release is a massive step forward with 100,000 lines of new code from a wide variety of contributors. The Apache Blog has the highlights of the new release.

ZooKeeper 3.5.0-alpha was released this week. The release resolves over 500 Jira tickets, which include a large number of bug fixes and improvements. Among the improvements are the ability to dynamically reconfigure the ZooKeeper ensemble, improvements to recovery, better support for jdk7 and openjdk, and more.


Curated by Mortar Data ( )



August SF Hadoop Users Meetup (San Francisco) - Wednesday, August 13

Apache HBase: Understanding Where to Use It and How to Use It, with Subash DSouza (Los Angeles) - Wednesday, August 13

Apache Solr (Irvine) - Thursday, August 14


Introduction to Spark Course: Spark Streaming (6 of 7) (Austin) - Wednesday, August 13


Using Apache Drill (Chicago) - Wednesday, August 13


Distributed Data Storage: Comparing Cassandra, HBase, ElasticSearch and GridGain (Conshohocken) - Wednesday, August 13

New York

Neo4j Intro Workshop (New York) - Tuesday, August 12


A Leap Forward for SQL on Hadoop (Manchester) - Tuesday, August 12


Workshop: SQL on Hadoop (Moscow) - Friday, August 15