Data Eng Weekly


Hadoop Weekly Issue #13

14 April 2013

This week's newsletter features fewer releases than normal (let me know if I missed something!) but has a lot of interesting technical articles. In addition, I'm pleased to announce the return of an events section. Thanks to the folks at Mortar Data (http://www.mortardata.com) for curating this list! They've found a number of great Hadoop-related events taking place all over the world this week.

Technical

Apache Pig provides support for expressive SQL-like join operations. In this post, Matthew Rathbone shows how to implement a left-outer join in Pig and write a unit test to check for correctness. This is his third article that demos a framework -- he previously covered MapReduce and Hive. This trifecta is quite an interesting comparison, so be sure to read all three if you missed the previous articles.

http://blog.matthewrathbone.com/2013/04/07/real-world-hadoop---implementing-a-left-outer-join-in-pig.html

If you're reading this newsletter, you probably don't need convincing, but Ofer Mendelevitch from Hortonworks offers some compelling reasons to use Hadoop for Data Science. Each reason (such as "Data exploration with full datasets") includes a discussion, and some reasons include a discussion of the tools available to aid a data scientist.

http://hortonworks.com/blog/4-reasons-to-use-hadoop-for-data-science/

Apache Ambari is a system for managing and configuring Hadoop and related projects such as Apache Zookeeper. This tutorial covers configuring a 6 node test cluster on EC2 with HDFS, Mapreduce, Nagios, Ganglia, HBase, ZooKeeper, Hive, HCatalog, and Zookeeper.

http://hortonworks.com/kb/ambari-on-ec2/

Vagrant provides a command line interface and tools to spin up and configure virtual machines with a Virtualbox, VMWare, or a cloud provider. This post explains using Vagrant to build and configure a (virtual machine) Hadoop cluster. The recipe from the post let's one build a 6-node cluster with a single command, vagrant up.

http://blog.cloudera.com/blog/2013/04/how-to-use-vagrant-to-set-up-a-virtual-hadoop-cluster/

Krishnan Raman of Twitter presented on using Scalding (a scala DSL for Cascading) and Algebird (Twitter's open-source abstract algebra framework) at BigData TechCon in Boston. In addition to the slides, the code and materials for the presentation have been posted to github.

https://github.com/krishnanraman/bigdata/blob/master/ProgrammingScaldingAlgebird.pdf?raw=true https://github.com/krishnanraman/bigdata

Splout is a SQL data store that includes tight-coupling with Hadoop and is suitable for serving real-time, web-scale traffic. This post is the 3rd in a series (the first two covered hive and cascading), and it covers loading data into Splout from Pig.

http://www.datasalt.com/2013/04/pig-splout-sql-for-a-retail-coupon-generator-a-big-data-love-story/

HCatalog provides access to Hive's metadata to other portions of the Hadoop stack (e.g. MapReduce and Pig) as well as via REST. This allows HCatalog to act as a glue between many different components in the stack. This blog post has a great overview of HCatalog's features and benefits.

http://hortonworks.com/blog/hivehcatalog-data-geeks-big-data-glue/

AirBnB has followed up their recent post about Chronos, their workflow and scheduling software, with an overview of their big data stack. They're using running Storm and Hadoop, in addition to Chroons, on a single Mesos cluster. They have some information about each and promise a follow up post with more details.

http://nerds.airbnb.com/distributed-computing-at-airbnb

News

The Hadoop Summit select committees have created the initial program for Hadoop Summit taking place this June in San Jose. More sessions will be posted over the coming weeks.

http://hadoopsummit.org/san-jose/program/

The Call for Proposals is open through May 16th for Strata + Hadoop World. The conference takes place in New York in October, and covers big data, data science, and pervasive computing. Proposals for 40-minute sessions as well as 3-hour tutorials are accepted for one of these topics.

http://strataconf.com/stratany2013/public/cfp/264

Releases

A few weeks ago, Wibidata announced version 1.0.0 of KijiSchema. KijiSchema is a data management system atop of Apache HBase focussed on real-time retrieval of diverse datasets. The 1.0.0 version marks a commitment to maintaining API compatibility going forward.

http://www.kiji.org/2013/04/02/announcing-kijischema-1-0-0/

Also announced a few weeks ago was the general availability of Platfora. Unlike other solutions that provide either an SQL interface or focus on BI tools, Platfora is trying to do both. A user asks a question via a web UI, and Platfora imports and caches data via MapReduce jobs in order to find an answer.

http://www.platfora.com/hadoop-ecosystem-blog/

Events

Curated by Mortar Data (http://www.mortardata.com)

Tuesday, April 16
Amazon Elastic Map Reduce - Hadoop Cloud Service (Hamilton Township, NJ)
http://www.meetup.com/nj-hadoop/

Tuesday, April 16
St. Louis Hadoop Users Group Meetup (Saint Louis, MO)
http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/110159842/

Wednesday, April 17
Automating the Hadoop Stack with Chef (San Diego, CA)
http://www.meetup.com/sd-hug/events/112475312/

Thursday, April 18
A recommendation system and MapReduce (New York, NY)
http://www.meetup.com/NYC-Machine-Learning/

Thursday, April 18
Hadoop 2.0: What's coming? (Toronto)
http://www.meetup.com/TorontoHUG/events/112153292/

Thursday, April 18
Big Data in the AWS Cloud + More (Norwich, UK)
http://www.syncnorwich.com/events/110574642/?eventId=110574642&action=detail

Friday, April 19
Big Hadoop Jobs on AWS (Munich, Germany)
http://www.meetup.com/Hadoop-User-Group-Munich/events/102940592/

Friday, April 19
Hadoop At Spotify (Kraków, Poland)
http://www.meetup.com/datakrk/events/113175722/