Data Eng Weekly

Hadoop Weekly Issue #21

09 June 2013

Cloudera announced Cloudera Search this week, bringing Solr/Lucene search to their distribution. In addition, Apache Hadoop had two releases -- for the 0.23 and 2.0 branches, and Cloudera announced a number of updates across their offerings. There are also a huge number of interesting technical articles. The quantity (and quality) of articles this week was even better than normal, and I had to make some tough decisions to avoid turning this week's newsletter into a novella (and it's still fairly massive as is!).


An overview of compiling Impala, the SQL-on-HDFS solution for Hadoop from Cloudera. Since the initial public release, Cloudera has regularly updated github with the source alongside each release. Given that Impala does things like code generation at runtime, there are a lot of pieces that need to be built.

It's pretty overwhelming to try to keep on top of all of the SQL-on-Hadoop solutions, and how they differ from one another. The DBMS2 blog has a post comparing the implementation of Hadapt, Greenplum's HAWQ/sql-H, and Microsoft's Polybase (which I haven't covered before). It also mentions Impala in the comments. The main areas of difference seem to be how data is stored in HDFS (i.e. open or proprietary formats) as well as how it's processed -- fully imported to the RDBMS or otherwise.

Owen O'Malley from Hortonworks presented on the ORC file format -- the successor to Hive's RCFile, which is included in Hive 0.11. ORCFile has a number of state of the art features such as bit-packing for integers, run-length encoding, column projection on read, and many more. ORCFile shares a lot of similarity to the Parquet file format, and it will probably gain a lot of momentum since it's built into Hive.

Data in HDFS is typically stored with a replication factor of 3. Folks storing large amounts of data have been trying to figure out ways to save costs by using erasure codes rather than a third copy. Facebook has written about doing this, and a new paper from folks at USC and Facebook details their new approach with improvements over Reed-Solomon encoding.

Some folks taking part in Hack Reactor have been using Pivotal's Analytics Workbench (the 1000-node cluster that Pivotal has made available to non-profit and other organizations) to set the world-record for the N-queens. They're also working on a new distributed computation framework called Smidge that runs as JavaScript in the browser.

The folks at amplab have open-sourced a benchmarking suite for large scale query engines. Their first results capture performance of Redshift, Impala, Shark, and Hive 0.10 while running in Amazon EC2. They plan to regularly update the results (adding in Hive 0.11 could be interesting), and the website contains instructions for running the benchmarks yourself.

Just days before the release of Cloudera Manager 4.6 (more below), the Cloudera blog featured a post on role group support in Cloudera Manager (I assume the same instructions apply to version 4.6). Role groups can be used for heterogenous clusters or for maintaining different configurations across clusters.

Perhaps old news, but the first time I came across it -- this is a tutorial for running Spark and Shark on Elastic MapReduce. Looks pretty easy -- and might be a good way to evaluate them if you have some data already in S3.

Facebook spoke about Presto, their internal low-latency SQL system that complements Hive. The say that it is over 10x faster than Hive (and can answer many questions in <1s) and is also much more CPU efficient. They have plans to open-source it this fall.

A presentation with some details about how the NSA is using Accumulu (the BigTable system they open-sourced in 2011) and MapReduce to do analysis on petabyte-sized graphs. They have some background, and overview of using Accumulu+MapReduce for graph processing, and some benchmarks.

Instagram started using Cassandra just over 6 months ago, and they've grown from a 3-node to a 12-node cluster. They run in Amazon on hi1.4xlarge instances (which have SSDs), and at peak they do 20,000 writes and 15,000 reads per second. The performance, cost-savings, and administrative benefits lead them to migrate to Cassandra from Redis.

Hive doesn't support row-level updates (data must be rewritten), but with some tricks you can support the appearance of updates. In short, store a timestamp with each row and group by the id and order by the timestamp to find the latest version of the record. This blog post details the strategy.


HadoopSummit is less than a month away, and Hortonworks has a list of reasons that you should go. In addition to talks, lightning talks, and meetups, the Wednesday night party at the Tech Museum should be a lot of fun (if last year's party is any indication).

Cloudera made a big announcement this week -- Cloudera Search. Like Hive opened Hadoop to the masses of folks proficient in SQL, Cloudera Search is meant to open Hadoop to anyone that can do a web search. Cloudera's implementation uses SolrCloud, which is a highly-available distributed indexing system. It can also do near-real time indexing of data as it flows into HDFS via Flume. A few weeks ago, MapR announced a similar product by partnering with LucidWorks -- but so far public details of that offering are a bit hard to come by.


Cloudera Manager 4.6 released. While only a minor version update, the free version sees a bunch of new features that were previous only available to enterprise customers -- proactive health checks, the ability to manage multiple clusters, log management/search, and kerberos support. It also has quite a few new features like support for Sqoop, WebHCat, and JobTracker HA as well as the recently announced Cloudera Search Beta.

Hadoop 0.23.8 released. It contains several bug fixes for problems found in 0.23.7 and some backports from trunk.

Hadoop 2.0.5-alpha released. This release focused on fixing compatibility bugs with downstream projects discovered with testing in Apache BigTop.

Impala 1.0.1 released. This version fixes compatibility with Cloudera Manager 4.2.6, and it also adds a few features (e.g. the ability to direct output to a file rather than the shell) and bug fixes (compatibility with Parquet, RCFile, and HBase).

Cloudera Developer Kit 0.3.0 released. This version adds support for Crunch, partitioned datasets, and a bunch of new examples.


Curated by Mortar Data ( < )>

Monday, June 10
The Stinger Initiative: 100x Hive Performance Improvements (New York, NY)

Tuesday, June 11
Data Ride with Doug Cutting (London, UK)

Tuesday, June 11
Distributed HBase and Large Object Storage Use Cases (Chicago, IL)

Tuesday, June 11
Hadoop Lab Training (Los Angeles, CA)

Tuesday, June 11
Twitter's Storm (Boca Raton, FL)

Tuesday, June 11
Apache Kafka at Spotify (+) Pentaho Data Integration and Hadoop (Stockholm, Sweden)

Wednesday, June 12
Hadoop use cases for financial services and insurance (Chicago, IL)

Wednesday, June 12
June 2013 San Francisco Hadoop Meetup (San Francisco, CA)

Wednesday, June 12
Big Data Technologies in Healthcare (Los Angeles, CA)

Wednesday, June 12
Hadoop / Big Data Review (Columbus, OH)

Wednesday, June 12
Insights to Big Data and Quality (Woodland Hills, CA)

Wednesday, June 12
Monthly Storm Meetup (Cambridge, MA)

Thursday, June 13
Big Data Science Practice + Algo Implementation #GLM (Mountain View, CA)

Thursday, June 13
A guide to Python frameworks for Hadoop (New York, NY) - hosted at FoursquareHQ - come say hi!

Friday, June 14
Big Analytics 2013 (Atlanta, GA)

Saturday, June 15
Functional Programming for Real-time Big Data Streaming