Data Eng Weekly


Issue #2

27 January 2012

Welcome to the second edition of Hadoop Weekly. Even with the short week in the US, there's lots of Hadoop community news to share. So far, I'm really enjoying finding the best links for each week to share with this list, and I hope all 281 of you are too!

Also, please forgive the funky links. Mailchimps requires that I build up a "click score" before they'll let me remove the click tracking from emails that I send out to this list.

Events

Feb 2nd Hadoop meetup in Bangalore, India-"Informatica Business Solutions"

http://www.meetup.com/Bangalore-Hadoop-Meetups/events/98883532/

Opscode and Edmunds are partnering to give a webinar on managing Hadoop with Chef on January 31st at 10am PT.

http://www.opscode.com/blog/2013/01/23/upcoming-webinar-with-edmunds-com-managing-a-hadoop-cluster-with-opscode-chef/

March 6 flume meet up in Palo Alto, CA.

http://www.meetup.com/Flume-User-Meetup/events/101073022/

Global Big Data Converence Jan 28 in Santa Clara

http://globalbigdataconference.com/registration.php

GigaOm Structure Data March 20-1, 2013

http://event.gigaom.com/structuredata/

News

A new peer-reviewed journal called "Big Data" is printing its first issue in February. Of note, the journal is open-access, and authors retain copyright under the Creative Commons License.

http://eddology.com/post/41253882293/big-data-journal

Maybe a bit old, but here's an infographic describing usage of hadoop in the US Federal government

http://www.swishdata.com/index.php/blog/article/hadoop-use-cases-big-data-for-the-government

Gartner is speculating that Hadoop will be really big by 2015. I think that's a safe bet :). Read more about their report and also a blog post claiming that Hadoop is in the "trough of disillusionment"

http://www.computerworld.com/s/article/9236173/Hadoopwillbeinmostadvancedanalyticsproductsby2015Gartner_says http://blogs.gartner.com/svetlana-sicular/big-data-is-falling-into-the-trough-of-disillusionment/

A blog post advocating that traditional DBAs should consider a move to being a Hadoop Administrator. This outlines some of the systems and parts that one needs to learn in order to make the move.

http://www.pythian.com/blog/hadoop-faq-but-what-about-the-dbas/

As we see more and more fragmentation in the Hadoop ecosystem (new filesystems, distributions, etc), Steve argues that tests should represent specifications that implementations need to match.

http://steveloughran.blogspot.com/2013/01/spec-vs-test.html

The Apache Mahout project, which is a collection of machine learning algorithms implemented for hadoop, has a new twitter account.

https://twitter.com/ApacheMahout

Microsoft's Big Data analysis system, HDInsight, is powered by Hadoop. Learn more about the platform, which is based upon Hortonwork's HDP and SQL Server, from Val Fontama of Microsoft.

http://www.infoq.com/news/2013/01/big-data-microsoft

Peaks under the hood

Presentation on the evolution of livedoor's event processing infrastructure which powers both batch (offline) and stream processing. They moved from scribe to fluentd for log collection, and the presentation has lots of great diagrams detailing their system architecture and other systems (e.g. HttpFs/Hoop and WebHDFS).

http://www.slideshare.net/tagomoris/log-analysis-with-hadoop-in-livedoor-2013

It's very difficult to do even the simplest of tasks with Hadoop. The recent oozie tutorial for scheduling a recurring task is a perfect example of this. Vendors can help reduce complexity, but is the ecosystem suffering?

http://nosql.mypopescu.com/post/41098465274/using-an-apache-oozie-tutorial-as-an-excuse-to-talk

An interview with Doug Cutting, the founder of Hadoop on the future of the platform.

http://www.kognitio.com/data-analytics-news/business-intelligence-and-analytics/hadoop-founder-predicts-further-improvements/801525936

Hadoop Conference Japan happened this past week, and there have been a few interesting slides from talks posted. This is an overview of the Huahin framework, which includes a few different parts include a java api, a job manager, and (notably) a manager for elastic mapreduce cluster (eManager).

http://www.slideshare.net/ryukobayashi/huahin-framework-for-hadoop-hadoop-conference-japan-2013-winter http://huahinframework.org/

You may recall from last week's newsletter that Cloudera and Skytap announced a partnership. Cloudera has published a blog post detailing how to launch a CDH4 cluster in the Skytap cloud.

http://blog.cloudera.com/blog/2013/01/how-to-deploy-a-cdh-cluster-in-the-skytap-cloud/

A tutorial for implementing a "people-you-may-know" map reduce job, including example code.

https://www.dbtsai.com/blog/hadoop-mr-to-implement-people-you-might-know-friendship-recommendation/

A tutorial on using MapReduce on http access logs.

http://www.informit.com/articles/article.aspx?p=2017061

At the open compute summit, Facebook described the 5 machine types that they buy. Their hadoop instance is 48GB of ram and has 12x3TB SATA drives.

http://www.greenm3.com/gdcblog/2013/1/23/facebooks-5-standard-servers-web-db-hadoop-haystack-feed.html

A guide to the various settings for controlling user logs on hadoop.

http://architects.dzone.com/articles/controlling-user-logging

Running Hadoop on virtualized hardware is often a contentious issue. This topic was recently discussed at the London VMware User Group.

http://stu.radnidge.com/post/41377430278

Oracle Solaris Zones is a virtualization system for Solaris. They've published a how-to for configuring a Hadoop cluster with zones.

http://www.oracle.com/technetwork/articles/servers-storage-admin/howto-setup-hadoop-zones-1899993.html

DataFu is a set of UDF (user-defined functions) for Pig. Built and open-sourced by LinkedIn, it has a number of very useful functions for a data discovery and analytics workflow. The Hortonworks blog has some examples of what you can do with DataFu.

http://hortonworks.com/blog/datafu/

Amazon Web Services (AWS) is a common prototype environment for Hadoop, but some folks end up running production data stacks on Hadoop in AWS. Soren Macbeth shares experience from running EMR, CDH, and M3 in AWS.

https://speakerdeck.com/sorenmacbeth/surviving-hadoop-on-aws-in-production

Storm has been called "real-time mapreduce" and Trident is a high-level framework for storm (similar to Pig on MapReduce). This tutorial details processing data with Storm/Trident and storing data in Splout SQL, which claims to be "a web-latency sql spout for Hadoop"

http://www.datasalt.com/2013/01/an-example-lambda-architecture-using-trident-hadoop-and-splout-sql/

Revolution Analytics has posted a video and slides for their "Using R with Hadoop" Webinar. The presentation covers RHadoop (which allows you to interact with hadoop from the R console) and using Hive from R via JDBC (among other things)

http://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/

Another great answer on Quora by Allen Wittenauer about choosing the right number of reducers for your MapReduce job. If you're not following Allen on Quora, you probably should be.

http://www.quora.com/Apache-Hadoop/What-are-good-ways-to-decide-number-of-reducers/answer/Allen-Wittenauer

Sentric did an interesting pilot program where they collected data from wifi access points and then did analysis in Hive and Impala to answer questions about people visiting retail stores (e.g. how many unique visitors). This is just part 1, and I look forward to more technical details in the follow up to this post.

http://www.sentric.ch/blog/case-study-retail-wifi-log-file-analysis-with-hadoop-and-impala-part-1

Evernote moved from a MySQL-based warehouse to a hybrid Hadoop-ParAccel (a columnar sql storage engine) based solution. They hook up JasperReports to ParAccel for reporting. This post contains details including info on their machine configurations.

http://www.javaworld.com/javaworld/jw-01-2013/130124-evernote-s-big-data-investment-in-hadoop-and-paraccel.html

Apache Whirr is a great way to quickly launch a cluster in the cloud. This blog post details bringing up a CDH4 cluster with ganglia for monitoring.

http://femgeekz.quora.com/Hadoop-Hangover-How-to-launch-CDH4-cluster-Ganglia-with-Apache-Whirr?srid=56EV&st=ns

Releases

Datastax is beefing up Cassandra for the enterprise in their latest release, DataStax Enterprise Edition 3.0. This version supports lots of encryption and authentication improvements such as object-level permissions and client-to-node encryption.

http://www.theregister.co.uk/2013/01/21/datastaxcassandrahadoop_solr/

The Axemblr team was busy this week releasing both version 0.2.0 and 0.2.1 of their cloud provisioner. New features in this release include: "support for custom repositories, pre-configured pools (jenkins and Cloudera CDH3 including Cloudera Manager), improved test suite and reduced code duplication."

https://github.com/axemblr/axemblr-provisionr/wiki/Release-0.2.1

Hortonworks has released their "Sandbox" which is a virtual machine image for running HDP, their hadoop stack. Of note, it includes support for HCatalog and includes support for both VMWare and Virtualbox.

http://www.cio.com/article/727125/HortonworksReleasesaHadoopSandbox?taxonomyId=3000

Whoops is a python library for WebHDFS. Version 0.1 was just released.

http://pythonwise.blogspot.com/2013/01/whoops-webhdfs-library-and-client.html

Factual has open sourced their dataworkflow management software, Drake (which is akin to 'make' for data). Drake has support for HDFS and is written in Clojure.

http://blog.factual.com/introducing-drake-a-kind-of-make-for-data https://github.com/Factual/drake

Klout has open source "Brickhouse", their collection of Hive UDFS. The UDFs include better json support, cardinality estimation, bloom filters/counters, and a UDF for bulk puts to HBase. Brickhouse requires Hive 0.9 or later.

http://engineering.klout.com/2013/01/introducing-brickhouse-major-open-source-release-from-klout/ https://github.com/klout/brickhouse

LinkedIn has pushed a major new version of Azkaban, its scheduling and workflow management system to Github. The new version uses MySQL as a data store (vs. the filesystem), includes user authentication/authorization, and many other features.

https://groups.google.com/d/msg/azkaban-dev/Mc1_gpsQ7uw/a97Cm-iKnVMJ

Attunity has released their Managed File Transfer (MFT) for Hadoop. The enterprise software is for moving bits in and out of HDFS.

http://www.sqlmag.com/article/business-intelligence/attunity-hadoop-big-data-145135

Industry News

Joyent, the cloud provider, has introduced Hadoop support. Word is that they're using Chef for orchestration.

http://gigaom.com/2013/01/23/joyent-offers-up-its-take-on-hadoop-as-a-service/

Dataguise and MapR have announced a partnership.

http://finance.yahoo.com/news/dataguise-certified-mapr-distribution-apache-130000485.html

Cloudera folks give lots of tech talks about their stack. They've recently compiled a list of all the talks in Q1 2013. The list is global, so there's sure to be a talk nearby.

http://blog.cloudera.com/blog/2013/01/where-to-find-cloudera-tech-talks-in-early-2013/

MapR has claimed a lot of big customers (10 in the fortune 100) including several large installs.

http://siliconangle.com/blog/2013/01/24/maprs-aggressive-partner-strategy-really-paid-off-in-big-data/

Software Developer's Journal is releasing a special issue feature Hadoop on Jan 30th. Presales are available now.

http://sdjournal.org/apache-hadoop-ecosystem/?aaid=bartoszmiedeksza&abid=45f0d439#!prettyPhoto

It sounds like Hortonworks is betting on Hive, and they're planning a Hive-YARN integration in 2013.. They have a lot of other interesting items on their 2013 roadmap like Ambari, Snapshotting, and HBase.

http://hortonworks.com/blog/the-road-ahead-for-hortonworks-and-hadoop/