Data Eng Weekly

Issue #2

27 January 2012

Welcome to the second edition of Hadoop Weekly. Even with the short week in the US, there's lots of Hadoop community news to share. So far, I'm really enjoying finding the best links for each week to share with this list, and I hope all 281 of you are too!

Also, please forgive the funky links. Mailchimps requires that I build up a "click score" before they'll let me remove the click tracking from emails that I send out to this list.


Feb 2nd Hadoop meetup in Bangalore, India-"Informatica Business Solutions"

Opscode and Edmunds are partnering to give a webinar on managing Hadoop with Chef on January 31st at 10am PT.

March 6 flume meet up in Palo Alto, CA.

Global Big Data Converence Jan 28 in Santa Clara

GigaOm Structure Data March 20-1, 2013


A new peer-reviewed journal called "Big Data" is printing its first issue in February. Of note, the journal is open-access, and authors retain copyright under the Creative Commons License.

Maybe a bit old, but here's an infographic describing usage of hadoop in the US Federal government

Gartner is speculating that Hadoop will be really big by 2015. I think that's a safe bet :). Read more about their report and also a blog post claiming that Hadoop is in the "trough of disillusionment"

A blog post advocating that traditional DBAs should consider a move to being a Hadoop Administrator. This outlines some of the systems and parts that one needs to learn in order to make the move.

As we see more and more fragmentation in the Hadoop ecosystem (new filesystems, distributions, etc), Steve argues that tests should represent specifications that implementations need to match.

The Apache Mahout project, which is a collection of machine learning algorithms implemented for hadoop, has a new twitter account.

Microsoft's Big Data analysis system, HDInsight, is powered by Hadoop. Learn more about the platform, which is based upon Hortonwork's HDP and SQL Server, from Val Fontama of Microsoft.

Peaks under the hood

Presentation on the evolution of livedoor's event processing infrastructure which powers both batch (offline) and stream processing. They moved from scribe to fluentd for log collection, and the presentation has lots of great diagrams detailing their system architecture and other systems (e.g. HttpFs/Hoop and WebHDFS).

It's very difficult to do even the simplest of tasks with Hadoop. The recent oozie tutorial for scheduling a recurring task is a perfect example of this. Vendors can help reduce complexity, but is the ecosystem suffering?

An interview with Doug Cutting, the founder of Hadoop on the future of the platform.

Hadoop Conference Japan happened this past week, and there have been a few interesting slides from talks posted. This is an overview of the Huahin framework, which includes a few different parts include a java api, a job manager, and (notably) a manager for elastic mapreduce cluster (eManager).

You may recall from last week's newsletter that Cloudera and Skytap announced a partnership. Cloudera has published a blog post detailing how to launch a CDH4 cluster in the Skytap cloud.

A tutorial for implementing a "people-you-may-know" map reduce job, including example code.

A tutorial on using MapReduce on http access logs.

At the open compute summit, Facebook described the 5 machine types that they buy. Their hadoop instance is 48GB of ram and has 12x3TB SATA drives.

A guide to the various settings for controlling user logs on hadoop.

Running Hadoop on virtualized hardware is often a contentious issue. This topic was recently discussed at the London VMware User Group.

Oracle Solaris Zones is a virtualization system for Solaris. They've published a how-to for configuring a Hadoop cluster with zones.

DataFu is a set of UDF (user-defined functions) for Pig. Built and open-sourced by LinkedIn, it has a number of very useful functions for a data discovery and analytics workflow. The Hortonworks blog has some examples of what you can do with DataFu.

Amazon Web Services (AWS) is a common prototype environment for Hadoop, but some folks end up running production data stacks on Hadoop in AWS. Soren Macbeth shares experience from running EMR, CDH, and M3 in AWS.

Storm has been called "real-time mapreduce" and Trident is a high-level framework for storm (similar to Pig on MapReduce). This tutorial details processing data with Storm/Trident and storing data in Splout SQL, which claims to be "a web-latency sql spout for Hadoop"

Revolution Analytics has posted a video and slides for their "Using R with Hadoop" Webinar. The presentation covers RHadoop (which allows you to interact with hadoop from the R console) and using Hive from R via JDBC (among other things)

Another great answer on Quora by Allen Wittenauer about choosing the right number of reducers for your MapReduce job. If you're not following Allen on Quora, you probably should be.

Sentric did an interesting pilot program where they collected data from wifi access points and then did analysis in Hive and Impala to answer questions about people visiting retail stores (e.g. how many unique visitors). This is just part 1, and I look forward to more technical details in the follow up to this post.

Evernote moved from a MySQL-based warehouse to a hybrid Hadoop-ParAccel (a columnar sql storage engine) based solution. They hook up JasperReports to ParAccel for reporting. This post contains details including info on their machine configurations.

Apache Whirr is a great way to quickly launch a cluster in the cloud. This blog post details bringing up a CDH4 cluster with ganglia for monitoring.


Datastax is beefing up Cassandra for the enterprise in their latest release, DataStax Enterprise Edition 3.0. This version supports lots of encryption and authentication improvements such as object-level permissions and client-to-node encryption.

The Axemblr team was busy this week releasing both version 0.2.0 and 0.2.1 of their cloud provisioner. New features in this release include: "support for custom repositories, pre-configured pools (jenkins and Cloudera CDH3 including Cloudera Manager), improved test suite and reduced code duplication."

Hortonworks has released their "Sandbox" which is a virtual machine image for running HDP, their hadoop stack. Of note, it includes support for HCatalog and includes support for both VMWare and Virtualbox.

Whoops is a python library for WebHDFS. Version 0.1 was just released.

Factual has open sourced their dataworkflow management software, Drake (which is akin to 'make' for data). Drake has support for HDFS and is written in Clojure.

Klout has open source "Brickhouse", their collection of Hive UDFS. The UDFs include better json support, cardinality estimation, bloom filters/counters, and a UDF for bulk puts to HBase. Brickhouse requires Hive 0.9 or later.

LinkedIn has pushed a major new version of Azkaban, its scheduling and workflow management system to Github. The new version uses MySQL as a data store (vs. the filesystem), includes user authentication/authorization, and many other features.

Attunity has released their Managed File Transfer (MFT) for Hadoop. The enterprise software is for moving bits in and out of HDFS.

Industry News

Joyent, the cloud provider, has introduced Hadoop support. Word is that they're using Chef for orchestration.

Dataguise and MapR have announced a partnership.

Cloudera folks give lots of tech talks about their stack. They've recently compiled a list of all the talks in Q1 2013. The list is global, so there's sure to be a talk nearby.

MapR has claimed a lot of big customers (10 in the fortune 100) including several large installs.

Software Developer's Journal is releasing a special issue feature Hadoop on Jan 30th. Presales are available now.!prettyPhoto

It sounds like Hortonworks is betting on Hive, and they're planning a Hive-YARN integration in 2013.. They have a lot of other interesting items on their 2013 roadmap like Ambari, Snapshotting, and HBase.