Data Eng Weekly


Hadoop Weekly Issue #20

02 June 2013

Two of the most prominent Hadoop distributions, Cloudera's CDH and Hortonwork's HDP, both saw releases this week. There are a few interesting new projects and some details on recent releases (Hive and SyncSort) as well as the normal slew of interesting technical articles about various components in the ecosystem (Zookeeper, Cassandra, HBase). We're also celebrating the 20th issue of Hadoop Weekly with our 600th subscriber. Thanks for spreading the word!

Technical

Zookeeper provides a set of powerful primitives for distributed consensus and locking, but there are a lot of edge cases and gotchas to consider when using it. The Apache incubator project Curator is a framework that addresses most of the edge cases and also implements several common recipes. This blog posts talks about some of the edge cases that are addressed in Curator, which should motivate you to use it rather than using the Zookeeper API directly.

http://blog.cloudera.com/blog/2013/05/zookeeper-made-simpler/

With Apache Hive 0.11 released last week, Hortonworks delivered on the first phase of their three phase "stinger initiative" to make Hive 100x faster. This post explores why Hortonworks is betting big on Hive (and YARN) rather than a separate SQL-engine (there are lots of examples of this -- Cloudera's Impala, EMCs Pivotal HD, and more). There are a few pretty interesting insights in this article about the history and future of Hive.

http://gigaom.com/2013/05/29/why-hortonworks-is-riding-a-faster-hive-to-the-bitter-end/

Eric Baldeschwieler aka Eric14 recently presented on Hadoop at JPL -- he covers the history of Hadoop, a number of common and interesting use-cases, and the future of Hadoop.

http://www.slideshare.net/jeric14/201305-hadoop-jplv3

Sqoop is a system for import/export between Hadoop (HDFS/Hive/HBase) and relational database systems. Kathleen Ting and Jaroslav Cecho are writing a cookbook for Sqoop, and it's available for pre-order (estimated to ship this month).

http://shop.oreilly.com/product/0636920029519.do

The DBMS2 blog has some more details on SyncSort's DMX-h ETL and DMX-h Sort editions announced last week. Interestingly, the ETL solution is not focussed on getting data into/out of Hadoop -- but rather in using Hadoop as an ETL engine (which seems to have overlap with MapReduce). Lots of interesting details here.

http://www.dbms2.com/2013/05/29/syncsort-extends-hadoop-mapreduce/

The folks at DataStax have summarized new features in recent releases of Cassandra 1.2.x. There are some interesting lessons here -- such as the increased performance with the -XX:+UseTLAB JVM option and moving to LZ4 compression from Snappy.

http://hortonworks.com/blog/hadoop-tutorials-real-life-use-cases-in-the-sandbox/

Intel announced their Hadoop distribution a few months ago, and it has some really interesting features like processor-optimized encryption, the Intel Manager for Hadoop, and Active tuner-- for automated job-specific tuning. As the article mentions, the Intel distribution also opens up Hadoop to enterprises that might not trust a non-enterprise software vendor.

http://www.datacenterdynamics.com/focus/archive/2013/05/intel-%E2%80%93-new-gorilla-hadoop-distributions

Nick Dimiduk, one of the authors of HBase in Action, recently gave a talk about the HBase architecture at the Seattle Technical Forum's Big Data Deep Dive. He posted the slides and the transcript, which provide a good overview of the HBase architecture and design goals.

http://www.n10k.com/blog/hbase-for-architects/

Weave is a new framework from Continuuity for making writing and running YARN applications as "simple as running threads." The framework has a bunch of interesting pieces -- from application lifecycle management with zookeeper to log/metric aggregation with kafka to a simple and straightforward API.

https://github.com/continuuity/weave

Mortar has open-sourced their Pig Loader and UDF template project. Given all the flavors of Pig Loaders and UDFs, this project should help anyone writing a new one get down to the business of implementing the logic for her user case.

http://blog.mortardata.com/post/51643568304/java-loaders-and-udfs-for-apache-pig-ditching-the

Releases

Cloudera's CDH 4.3 was released. It includes new versions of HBase, HCatalog, Oozie, Pig, and fixes/feature improvements to HDFS, Flume, Hue, and Sqoop. One notable feature that Cloudera has highlighted is the ability to balance data among disks on the same datanode.

https://blog.cloudera.com/blog/2013/05/cdh-4-3-is-released/

Hortonworks Data Platform (HDP) 1.3 was released with new versions of every component in the Hadoop stack. Major new features include NFS access to HDFS, HBase Master high availability, an improved Ambari, and a new version of Hive with the first speedups from the stinger initiative.

http://hortonworks.com/about-us/news/new-hortonworks-data-platform-release-showcases-power-of-open-source-innovation/ http://hortonworks.com/products/hdp/hdp-1-3/

Cassandra 1.1.12 was released with a few bug fixes.

https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/cassandra-1.1.12

Events

Curated by Mortar Data ( <http://www.mortardata.com )>

Tuesday, June 4
Intro to Hive and HCatalog - Calling all Beginners (New York, NY)
http://www.meetup.com/Hadoop-NYC/events/119135752/

Tuesday, June 4
June Hive User Group Meetup in SF (San Francisco, CA)
http://www.meetup.com/Hive-User-Group-Meeting/events/118637862/

Tuesday, June 4
Bridging Audiences across Devices at Scale (San Mateo, CA)
http://www.meetup.com/Data-Mining/events/117230492/

Wednesday, June 5
Getting Value from Your Data; Cognitive Technology Event Warning System (Arlington, VA)
http://www.meetup.com/Hadoop-DC/events/117489812/

Wednesday, June 5
June Hadoop Meetup: Dremel, Hive & Pig (London, UK)
http://www.meetup.com/hadoop-users-group-uk/events/120740832/

Wednesday, June 5
DataPhilly June 2013 - Hadoop: BigSheets & Pig (Philadelphia, PA)
http://www.meetup.com/DataPhilly/events/120425212/

Wednesday, June 5
Support Vector Machines in MapReduce (San Francisco, CA)
http://www.meetup.com/sfmachinelearning/events/116497192/

Wednesday, June 5
Tech Talk:- Hadoop Architecture (Bangalore, India)
http://www.meetup.com/BigDataJam/events/121580762/