Data Eng Weekly

Hadoop Weekly Issue #63

30 March 2014

Hortonworks announced a new round of funding this week, and Intel and Cloudera announced a major new partnership. There’s a lot of money being put into the Hadoop ecosystem, which is rapidly changing. Lots of articles this week cover the evolving set of frameworks making up Hadoop data pipelines like Storm and Spark.


The Cloudera blog has a guest post about Spark Streaming from engineers at Sharethrough. The post walks through their migration from a batch-processing system using Scalding to a micro-batch system using Spark Streaming. The new architecture means that data is reflected in the system withins seconds rather than an hour. The post goes into some of the technical details and lessons learned during their migration.

Spotify has introduced Storm into their backend to complement batch-processing with Hadoop. This talk gives some insight into their deployment, how it fits into their data pipeline, details on some of the features they’re powering with storm, and more.

The Hortonworks blog has a post on the recently released Apache Storm 0.9.1-incubating. The article details the new Netty-based messaging transport, added support for Windows, and a switch to Maven for builds. It also covers what to expect in the next release of Storm.

Cloudera has integrated Apache Sentry, the fine-grained authorization system for Hadoop, with Cloudera Search, the Apache Solr integration with CDH. A post on the Cloudera blog details the authorization and authentication layers in Cloudera search as well as how secure impersonation is done from Hue.

The MapR blog has a post about several key terms related to big data. It covers the difference between data stream management systems (DSMS) and database management systems (DBMS), batch processing vs interactive mode, and real-time vs low-latency. It also talks about the very overloaded ‘streaming’ term in the Hadoop ecosystem.

Hortonworks is planning to ship Apache Falcon (incubating), which is a data management and governance system for Hadoop, with HDP 2.1. They’ve published a post describing what Falcon does in depth. It also includes tutorials to build example pipelines (using Pig) and implementing cross-cluster replication.

Packt has a post from authors of the book “Storm Blueprints: Patterns for Distributed Real-time Computation” on running Storm on YARN. The post gives a brief overview of Hadoop focusing on how Storm complements MapReduce for real-time processing. It then talks about the architecture of Storm on YARN.

The Databricks blog has a post describing a new feature recently added to Apache Spark called Spark SQL. Whereas Shark uses Spark as a backend to Hive, Spark SQL provides a mechanism to invoke distributed SQL from a Spark job and perform additional processing on the data using Spark’s RDDs. It also enables persisting of Spark RDDs to Hive. The post has a detailed overview of the system and its optimization framework called Catalyst.

Doing anything interesting with large data tends to be a tall task. In addition to compute horsepower, there is a lot of infrastructure required to do something non-trivial. A post on Datanami explores the under respected task of data cleaning, which often ends up being a large part of the data pipeline. The article includes interviews with some folks in industry about the importance of scrubbed data.


AMD has migrated 276 TB from Oracle to Hadoop. The Register has an interview with AMD’s CIO in which he explains the motivation behind the switch.

Answering the question “What is Hadoop?” is becoming increasingly difficult (it’s a question I ask every week as I’m evaluating articles for this newsletter). The Gartner blog has a post exploring this topic, including how each of the vendors have taken on different set of components for their distribution.

Hortonworks announced their Series D round of funding, which totals $100 million. In addition to capital from existing investors, the new round was led by BlackRock and Passport Capital. Hortonworks says that they'll be using the money to scale engineering efforts, global operations, and their ecosystem.

There’s been a lot of discussion on the Apache Mahout mailing list about the future of the project. GigaOm has an article summarizing the output of the discussion—the Mahout community has decided to support Apache Spark and the H2O framework rather than MapReduce.

On the heels of $160M in funding announced last week, Cloudera and Intel announced a deal in which Intel is investing a rumored $90M+. In addition to the investment, Intel is dropping their own distribution and will work with Cloudera on CDH. The Cloudera blog has detailed their thoughts on the partnership, and SiliconANGLE has commentary about the industry-wide implications of the deal.


Curated by Mortar Data ( )



Big Data Automation with In-Memory Computing (San Mateo) - Monday, March 31

Big Data World: MongoDB Momentum, Oracle NoSQL & More! (Redwood City) - Wednesday, April 2

Apache Shark & Storm Use Cases (Santa Clara) - Thursday, April 3

Building a Future-Proof Data Warehouse & New Rules of Data Storytelling (Palo Alto) - Thursday, April 3

Deep Learning: Theory, Practice and Predictions! (San Francisco) - Thursday, April 3

Big Data Analytic Topic - Cloudera presentation on SPARK (Irvine) - Friday, April 4

Machine Learning on Big Data (Mountain View) - Saturday, April 5


Spring 2014 Seminar Series: Big Data Infrastructure (Tacoma) - Wednesday, April 2


Big Data Technlogies - HBase and Data Processing at HubSpot (Portland) - Thursday, April 3


Advanced Hadoop Based Machine Learning (Austin)


An Intro to Apache Hadoop (with an Emphasis on SQL) by Cloudera (Phoenix) - Wednesday, April 2


Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, March 31

New York

Workshop for Beginners I: Getting started with Hadoop (New York) - Friday, April 4


Real-time Big Data Analytics using Aerospike NoSQL, Storm and Hadoop (Boston) - Tuesday, April 1


SHUG 11 Anomaly Detection Ted Dunning (Stockholm) - Monday, March 31


SAS Meetup at Hadoop Summit Amsterdam (Amsterdam) - Tuesday, April 1

Apache HBase - Birds of a Feather Session (Amsterdam) - Tuesday, April 1

Pre Hadoop Summit meetup: Andrew Wang, Chris Wensel and Doug Cutting (Amsterdam) - Tuesday, April 1

Apache Hadoop YARN - Birds of a Feather Session (Amsterdam) - Wednesday, April 2


22nd meetup - Cloudera on HBase and Sqoop (Leuven) - Friday, April 4


ParisDataGeeks April Second Time @Criteo (Paris) - Friday, April 4


Bangalore Baby Hadoop Meetup (Bangalore) - Saturday, April 5