Data Eng Weekly

Hadoop Weekly Issue #56

09 February 2014

StrataConf is this week in Santa Clara, and there were a lot of announcements this week in anticipation. Among the most notable, Cloudera announced new packaging for their enterprise software and support for Apache Spark 0.9.0. Spark was a hot topic this week—version 0.9.0 was released and it was covered in an interview with Doug Cutting and on the Monash Research blog. We also get a peak inside several data pipelines this week with a post covering big data at Stripe, Tapad, Etsy, and Square as well as details on the Hootsuite log pipeline. Also don’t miss Ramya Sunil’s post on women in the Hadoop open-source ecosystem.


The Hortonworks blog has a walkthrough of building an HDP cluster with Amazon Web Service’s EC2. The tutorial starts with creating a custom AMI, describes installing password-less private keys (something you likely don’t want to do for a production system), starting the Apache Ambari server, and using Ambari to provision the cluster. The tutorial is loaded with screen shots to help you on your way.

HootSuite is a social media service that connects with a large number of social networks. This post explores their logging streaming systems which is used very heavily (“Hootsuite logs everything”). There are some interesting ideas in their system—they only use one log level, and all system logs are stored in ElasticSearch for two weeks for interactive debugging. BI logging is forwarded to Hadoop to answer BI questions, and user events are sent to both systems.

MongoHQ, who offers MongoDB as a Service, stirred up the Hadoop community this week with a blog post entitled “You Don’t Have Big Data.” In response to this article, Pete Soderling of Hakka Labs assembled details on real-world big data pipelines at Stripe, Tapad, Etsy, and Square. Each company’s story includes interesting details—from deeply nested data at Stripe to TBs per hour at Tapad to Hadoop powering Vertica at Etsy to a large-scale reconciliation system at Square. The technical details demonstrate that some folks certainly do have big data.

The Cloudera blog has a tutorial on Apache Giraph, the graph-processing framework for Hadoop. The post details how a Giraph job runs inside of MapReduce (in the future it will use YARN), including the Bulk Synchronous Processing model that Giraph builds on. The tutorial then goes through the steps required to setup a test deployment using the Cloudera QuickStart VM. Finally, it shows how to run Giraph’s builtin PageRank and Shortest Path implementations.

I’m a big fan of Vagrant, a system for automating local VMs, for creating a local dev setup. In this tutorial, you’ll learn how to setup a Vagrant VM with Hadoop installed via Apache Ambari. The tutorial walks through the vagrant commands necessary to boot a local VM and the subsequent setup inside of the VM to get Ambari running.

News has a post by Ramya Sunil of Hortonworks as part of their Women in Open Source Week. Ramya has compiled data on the ratio of women committers in various Hadoop ecosystem projects. As you may have guessed, there are very few woman committers and PMC members. With that said, Ramya explains that in her experience the community has been inclusive and supportive. The post wraps up with some tips for getting started with contributing to open-source.

Computing has an interview with Hadoop co-founder and Cloudera chief architect Doug Cutting about the rise and future of Hadoop. Cutting talks about his surprise in the success of Hadoop, the advantages of open-source communities, YARN, Impala, and Spark. Doug praises the engineering of YARN, but he notes that Hadoop, via Impala and Spark, has been used to power diverse workloads before YARN hit GA. It’s interesting to hear Doug’s thoughts on the diversification of Hadoop workflows.

Apache Spark (incubating) has gotten a lot of press recently with Cloudera throwing their weight behind it. Curt Monash recently spoke with Cloudera and databricks (the company commercializing Spark) about Spark and its role in the Hadoop ecosystem. As always, the post is practical and insightful—covering some of the confusion around Spark/databricks, Spark for machine learning and data transformation, Spark Streaming, and much more.

0xdata and Hortonworks have announced that that 0xdata’s flag shift project, H2O is joining the Hortonworks Partner Program. H2O is an in-memory machine learning and predictive analytics platform built on Hadoop. The release suggests that H2O will take advantage of YARN on HDP2.

Intel announced that they’ve partnered with Nutanix to bring support for the Nutanix virtualized storage platform to Intel’s distribution of Hadoop (IDH). Of note, Nutanix provides disaster recovery via snapshot replication across data centers.

Cloudera has announced a reshuffling of its product packaging around the Enterprise Data Hub strategy. The new approach simplifies the offering vs. the previous piecemeal offering in which customers bought support for Hadoop and paid extra for add-on services (like HBase and Impala). The three Cloudera Enterprise options are the “Basic Edition” (HDFS and MapReduce), the Flex Edition (Basic + one add-on), or the Data Hub Edition (which includes HBase, Impala, Spark, and Search).

Curt Monash has another great post on various forms of SQL and Hadoop integrations. He describes the distinction between a Hadoop “connector” and SQL-on-Hadoop engines, and then goes on to describe and classify several projects/productions including: Apache Hive, Cloudera Impala, the Stinger Initiative, Teradata SQL-H, Microsoft Polybase, Hadapt, and Splice Machine.

Trifacta has announced their first product, the Trifacta Transformation Platform. The product aims to reduce the amount of time that folks spend in “data wrangling” when trying to build insights from large datasets stored in Hadoop. The platform combines a spreadsheet-like UI with machine learning to build structure from a raw dataset. It reminds me of OpenRefine (the ex-Google Refine project).

The Barron’s blog has a report from the Teradata Q4 report and conference call. In short, Teradata doesn’t seem to be losing business because folks are moving to Hadoop. With that said, one third of their top-50 customers have a production Hadoop cluster, and the other two-thirds are evaluating a deployment.

BeyeNetwork has a post that contrasts the business strategies of Hortonworks and Cloudera. The post has interesting observations about the two companies plans to innovate, to partner, and for open-source. It also contains two one-liners (that really oversimplify a lot of things but are still interesting)—“Cloudera offers revolution, Hortonworks Evolution” and “Hortonworks partners, Cloudera competes.”

GigaOm has a post on how telcos are using Hadoop to improve their services. There are use-cases from four companies—one that used Hadoop to analyze usage to plan infrastructure buildout, a second that analyzed records to find and fix areas of poor cell coverage, a third that used Hadoop to target network maintenance, and a fourth that used it to detect bandwidth hogs who had to be throttled.


Apache Spark (incubating) version 0.9.0 was released this week. The new release moves the codebase to Scala 2.10, includes a new graph processing framework, moves Spark Streaming out of alpha, and simplifies High Availability. The release notes contain a detailed overview of the changes and links to download pre-built packages for Hadoop 1, Hadoop 2, CDH3/4/5, and HDP 2.

Driven is a new service from Concurrent, the company behind Cascading. Driven aims to give Hadoop developers and operators better insights into performance bottlenecks and errors in Cascading workflows. The product is a web service that monitors Cascading (and derivatives like scalding) application performance, errors, historical runs, and more. It also has visualization features to inspect the workflow graph of a cascading job.

IBM has released a new version of their Hadoop product, InfoSphere BigInsights v2.1.1. The new release of BigInsights is packaged into both Standard and Enterprise editions. The Standard Edition includes Big SQL (SQL-on-Hadoop), BigSheets (web-based analytics and visualization), Eclipse-based development tools, and a management console. The enterprise version includes those features as well as additional features such as support for the General Parallel File System (GPFS) File Placement Optimizer and Adaptive MapReduce.

CDH 4.4+ has gained support for Apache Spark 0.9.0. The uses Cloudera Manager to distribute binaries, but the user must start Spark otherwise. Cloudera plans to bring full-support to Spark in a future release, including running Spark on YARN. The Cloudera vision blog has more details on the integration and how it fits into the Cloudera enterprise data hub.

Version 0.11.0 of the Cloudera Kite SDK includes a views API, updates to the morphlines API, and several bug fixes and improvements. The views API looks particularly interesting for working with HBase.

Apache Cassandra announced maintenance releases of Cassandra 1.2 and 2.0. Version 1.2.15 has two bug fixes, and version 2.0.5 has several bug fixes and improvements.;a=blob_plain;f=CHANGES.txt;hb=refs/tags/cassandra-1.2.15;a=blob_plain;f=CHANGES.txt;hb=refs/tags/cassandra-2.0.5

Apache Mahout 0.9 was released. The new version resolves over 100 issues. New features include an implementation of Multilayer Perceptron and a new linear algebra scala DSL.

The Scala MapReduce framework Scoobi released version 0.8.0 this week. The new release includes support for Hadoop 2 (while also retaining compatibility with CDH4), adds support for counters, and adds support for partitioned datasets.


Curated by Mortar Data ( )


Washington State

Winter 2014 Seminar Series: Big Data Infrastructure (Tacoma, WA) - Wednesday, February 12


HBase Meetup @ Continuuity (San Francisco) - Monday, February 10

Apache Lucene: Then and Now (San Francisco) - Monday, February 10

BigDataCamp 2014 (Before StrataConf) (San Jose) - Monday, February 10


Hadoop, the Data Lake, and a New World of Analytics (Boulder) - Thursday, February 13


Houston Hadoop Meetup Series (Houston) - Wednesday, February 12

Advanced Hadoop Based Machine Learning (Austin)


Intro to Apache Hadoop (Saint Louis) - Thursday, February 13


Parkour: Hadoop MapReduce in idiomatic Clojure (Atlanta) - Tuesday, February 11

New York

Data Governance, Compliance and Security in Hadoop - with Cloudera (New York) - Monday, February 10

Lucene/Solr: The Default Search Engine for Hadoop (New York) - Wednesday, February 12

Establishing Data Infrastructure And Hadoop Within A Large Existing Ecosystem - Wednesday, February 12


On the move with Big Data - SSIS, Pig and Sqoop (Ottawa, ON) - Thursday, February 13


February Hadoop Meetup: Hadoop-as-a-Service & Zookeeper (London) - Tuesday, February 11


Finding a needle in a stack of needles - adding Search to the Hadoop Ecosystem (Budapest) - Wednesday, February 12


Hadoop 2: Taking Hadoop beyond MapReduce (Karlsruhe) - Thursday, February 13