05 May 2013
There were two big and exciting releases this week from Hadoop vendors -- Cloudera with Impala and MapR with M7. In addition, this week marks the 500th subscriber to Hadoop Weekly! Thanks everyone for subscribing, and please send anything my way that you think might make a good addition to this newsletter.
In the third and final part of his "Introduction to Hadoop" series, Tom White covers higher-level frameworks, anatomy of a Hadoop cluster, and data application pipelines. In terms of frameworks, he covers Pig, Hive, and Crunch (there's a nice example of computing top-K with Crunch).
http://www.drdobbs.com/database/introduction-to-hadoop-real-world-hadoop/240153375/
HBase uses Hadoop's metrics framework to expose metrics via JMX, HTTP, etc. This post covers the migration from the old metrics framework ("metrics1") to the new framework ("metrics2"). Like most things that make use of a Hadoop API, there's extra complications because the hadoop-1 and hadoop-2 branches have diverged significantly.
https://blogs.apache.org/hbase/entry/migration_to_the_new_metrics
Cloudera Impala (low-latency SQL on HDFS) hit GA this week (more below), and some of the first benchmarks are starting to come out. Unsurprisingly, this presentation shows that snappy-compressed parquet files performed the best on their 11-node cluster, with average performance 12x that of Hive.
http://www.slideshare.net/sudabon/performance-evaluation-of-cloudera-impala-ga
An interesting presentation about some real-world experience using Hadoop and Vertica for offline analytics at AdTech. They have some useful information about using Avro with Flume and Pig and the interaction between Hadoop and Vertica.
http://prezi.com/6r0rsx6hluu1/mapreduce-in-action-large-scale-reporting-based-on-hadoop-and-vertica/
Andy Feng and Robert Evans recently presented on using Storm (a realtime distributed computation system) at Yahoo. In particular, their presentation covers collocating storm and hadoop on the same cluster using YARN.
http://www.slideshare.net/ydn/april-2013-hug
A case-study of a Hive query that started out taking over 2 hours but was optimized to take under 10 minutes. They used a bunch of tricks to speed it up, including enabling map-side joins and optimizing the number or reducers.
http://dennyglee.com/2013/04/26/optimizing-joins-running-on-hdinsight-hive-on-azure-at-gfs/
Ambari is an Apache Incubator project for provision, configuring, and monitoring Hadoop (and related services) clusters. Hortonworks posted a walkthrough for setting up your first Ambari cluster. Tools like this (another option is Cloudera Manager) are a must for managing a Hadoop cluster.
http://hortonworks.com/kb/get-started-setting-up-ambari/
Cloudera announced Impala 1.0 GA this week. It supports many of the same commands and datasets as Hive, but with much lower latency. Cloudera is reporting performance gains anywhere from 6x to 68x vs. Hive, and also better-than-linear scaling for multi-tenant queries. Given that Cloudera announced Impala less than a year ago, the fact that Impala has already reached version 1.0 is quite impressive and exciting (even if not all the features are there) -- especially so since Impala is open-source.
MapR announced Mapr M7 Edition. This version touts features like 5-9 availability and other features targeting the NoSQL market. Main highlights include "no manual administrative tasks such as table merges or splits," data snapshots, and data mirroring.
http://www.mapr.com/products/mapr-editions/m7-edition
SAS, the business analytics software company, has partnered with Cloudera to integrate their software with Hadoop.
http://www.marketwire.com/press-release/cloudera-announces-strategic-alliance-with-sas-1783697.htm
MapR and LucidWorks announced a partnership to include LucidWorks Search with MapR's distributions. LucidWorks Search provides a number of advance features for Lucene/Solr. It's a little unclear what the technical details of this integration look like, but it sounds like they are providing indexing of data stored in MapR's FileSystem.
Hadoop Summit is coming up in less than 2 months in San Jose. The Hadoop Summit organizers, Hortonworks, have unveiled the full scheduled for the conference.
http://hortonworks.com/blog/hadoop-summit-schedule-is-now-available/
While many business analytics solutions focus on structured data accessible via SQL (over JDBC or ODBC), Precog is focusing on providing the same capabilities on unstructured data sitting in HDFS or elsewhere. They announced this week that their product is leaving beta.
http://gigaom.com/2013/05/01/precog-launches-with-a-plan-to-simplify-analytics-on-unstructured-data/
Cloudera Development Kit 0.2.0 was released with experimental support for the new Parquet columnar file format as well as integration with Hive/HCatalog for metadata management.
https://groups.google.com/a/cloudera.org/d/msg/cdh-user/Q9teffdaJzY/rmtMpXoBMA4J
Amazon Elastic MapReduce now has support for S3's server-side encryption. This includes S3DistCp, the EMR-optimized distcp implementation.
Curated by Mortar Data (http://www.mortardata.com)
Monday, May 6 Deep Dive into Cloudera Impala! (New York, NY) http://www.meetup.com/Hadoop-NYC/
Tuesday, May 7 NoSql & Hadoop with Couchbase server (Tel Aviv, Israel) http://www.meetup.com/Big-Data-Israel/events/115467802/
Wednesday, May 8 San Francisco Hadoop Meetup (San Francisco, CA) http://www.meetup.com/hadoopsf/events/114514922/
Thursday, May 9 Hadoop At Spotify (Warsaw, Poland) http://www.meetup.com/warsaw-hug/events/111492272/
Thursday, May 9 Analyzing Twitter: An End-to-End Data Pipeline (Baltimore, MD) http://www.meetup.com/Data-Science-MD/events/111081282/