Data Eng Weekly

Hadoop Weekly Issue #72

01 June 2014

Apache Spark 1.0 was released this week. And there are a number of posts this week about Spark, including a post describing how eBay is starting to use Spark. This week is Hadoop Summit in San Jose, and there’s some anticipation building including two posts on the Hortonworks blog about Discardable Memory and Materialized Queries that will be presented at the summit. I’m sure there will be a lot of great presentations; please forward them my way as I won’t be attending.


The SequenceIQ blog has a post about building a command-line tool for Apache Ambari using Spring Shell and the Ambari REST API. The code for ambari-shell is available on github, there’s a binary available as a docker image, and ambari-shell is slated for inclusion in Ambari 1.6.1. The post has a tour of the features of the tool.

Discardable Memory and Materialized Queries (DMMQ) are proposed enhancements to HDFS and Hadoop query engines (such as Pig, Hive, and Cascading) that aim to take better advantage of the RAM available in a Hadoop cluster. My understanding of the features (in brief summary form) is that systems can build materialized queries (in the ‘materialized view’ DB sense) that become available to a query optimizer. Another optimization system looks at query load to discard materialized queries that are under-used. The article has much more detail, and there is a comment from the author comparing DMMQ with Spark’s RRDs.

Related to DMMQ, Discardable Distributed Memory (DDM) is a proposed new feature of HDFS for implementing DDMQs. Data in DDM is discardable and lazily backed by an HDFS file. It might be used to store intermediate outputs from Hive or Spark RDDs. The full post has a lot more details on the architecture and links to relevant background references.

Lots of folks love Spark at first glance, but stories from early adopters describe several gotchas that crop up as you try to do non-trivial tasks with it. This post looks at many of the problems and types of debugging & tweaking that need to be done to make Spark scale. There’s also a good discussion of other tools and probable fixes in newer versions of Spark in the comments.

The Cloudera blog has a post describing the implementation details of Spark-on-YARN. It contains an overview of the various daemons and data flow paths that exist during the lifetime of a Spark application (and contrasting these to that of a MapReduce job running on YARN). It discusses two options for running a Spark job on YARN—yarn-cluster and yarn-client—as well as the tradeoffs of these options. If you’re looking to deploy Spark on a YARN cluster, these details will be very valuable.

Puppet is a configuration management tool for provisioning compute instances. The latest episode of the weekly Puppet podcast covers the CDH module for Puppet, which can be used to deploy the CDH stack.

As I’ve said before, I’m always a bit wary of vendor benchmarks because a vendor typically won’t publish results unless they’re favorable for their own system. Regardless, it’s still interesting to read about how folks benchmark and what kinds of results they’re seeing with various system. In this experiment, Cloudera tried out Cloudera Impala, Hive-on-Tez, Shark, and Presto.

BOSH is a tool for provisioning distributed systems with support for several cloud providers and IaaS tools like AWS EC2, Google Compute Engine, and OpenStack. A post on the Pivotal Blog details how to use BOSH to provision a Hadoop cluster on Google Compute Engine in less than three minutes.

eBay has started using Apache Spark on their Apache YARN Hadoop cluster. A post on the eBay tech blog describes some of the things that they’re using Spark for including k-means clustering and Shark (the Hive-on-Spark implementation). There’s a code-snippet detailing how to use the k-means clustering library in Spark and there’s some background on what Spark is and who it works.

RHadoop is a system for accessing data in HDFS and HBase from R as well as for running MapReduce jobs from R. This tutorial details how to setup Hadoop, HBase, R, and RHadoop on a Mac using Homebrew. There’s a fair amount of configuration required on the R side of things once all the pieces are in place, and the tutorial caps off with running WordCount as a MapReduce job using RHadoop.

Java Code Geeks has a post describing how to index a data set in ElasticSearch using Hive. The post is the fourth in a series revolving around end-to-end processing of search click data. The post also shows some examples of using the ElasticSearch functionality of Spring Data to run queries against the ElasticSearch cluster.


Hadoop Summit is this week in San Jose, and the Hortonworks blog has some details on the event. There are a number of meetups taking place in addition to the conference (see the events section below).

Last week there were two articles that touched on running Hadoop in the cloud. Whereas they focussed more on performance (particularly HBase) and price, this article takes a more holistic view of why we might start seeing more Hadoop deployments in the cloud. For instance, if you already have all of your data in a blob store like Amazon S3, it might not make sense to copy all of your data to the data center.

Trifacta announced a series C financing round of $25 million. Trifacta’s system aims to increase analyst and developer productivity by pre-processing raw data. Their new round will help them expand marketing and sales efforts.

Forbes Contributor Dan Woods has written a piece that raises some frank and tough questions about Cloudera’s business model. The two main questions raised in the article are 1) Does Cloudera want to position itself as a replacement for or complement of the enterprise data warehouse? and 2) How will Cloudera differentiate itself (and keep customers)? I find it hard to believe Cloudera would raise so much money without good answers to these questions, but there definitely seems to be some miscommunication in the press.

Videos from Berlin Buzzwords have been posted, and their are a number of talks about components in the Hadoop ecosystem. Speakers include Roman Shaposhnik of Pivotal (on Apache Giraph), Steve Loughran of Hortonworks (on YARN application development), Ted Dunning of MapR (on deep learning on time series), and Mark Miller & Wolfgang Hoschek of Cloudera (on integrating search and Hadoop). Seems like it was a great conference with a number of high-quality talks.


Cloudera released CDH 4.7 and Cloudera Search 1.3. CDH 4.7 contains a number of bug fixes to HBase, HDFS, Hue, Hive, and more. Search 1.3.0, which requires CDH 4.7, is also a bug fix release.

Apache Spark has hit version 1.0. With the new release, the project is guaranteeing API compatibility across the 1.x release. Version 1.0 also adds a number of new features and improvements, including Spark SQL, better support for secure Hadoop, and updates to MLlib (for machine learning), Spark Streaming, and GraphX (graph processing).

Apache Ambari 1.6.0 was released. This version of the Hadoop cluster management software brings support for “blueprints” which provide reusable templates for configuring clusters. The Hortonworks blog has more details on this feature as well as the other new features of the release, Stacks and PostgreSQL support.

HP announced the next version of their Vertica Analytics Platform, and it includes a number of new features to integrate with Hadoop. In addition to directly supporting data stored on HDFS, the new version can read data stored in Parquet and Avro files.

PivotalR is a new open-source tool from Pivotal to integrate R with a SQL backend running on a Hadoop cluster. It supports MADlib for running machine learning algorithms directly in the DB. The source-code is available on github, and it currently supports Greenplum, Pivotal HD / HAWQ, and PostgreSQL backends.

Hive_test is a tool for writing tests against an embedded Hive using an embedded Derby DB. It also provides Maven tools for integrating with Hadoop.


Curated by Mortar Data ( )



"Mathematical Shape of Big Data" @ Hadoop Summit (San Jose) - Monday, June 2

A Leap Forward for SQL on Hadoop & Lightning Talks @ Hadoop Summit (San Jose) - Monday, June

Meet the AWS Elastic MapReduce Team at the Hadoop Summit (San Jose) - Wednesday, June 4

Accelerate Big Data Application Development with Cascading and HDP (San Jose) - Monday, June 2

Mining Big Data with Apache Spark (San Francisco) - Thursday, June 5

Hadoop Summit 2014 Oozie Meetup (San Jose) - Tuesday, June 3

Apache Ambari Birds of Feather Session (San Jose) - Thursday, June 5

Hadoop 2 Changes Everything. What’s Next? (La Jolla) - Thursday, June 5

Big Data Analytics by Alteryx (Irvine) - Friday, June 6


Make the Elephant Fly: Real-Time Data Performance (Denver) - Wednesday, June 4


Introduction to Hadoop, Part 1 (Tempe) - Wednesday, June 4


Big Data, Big Challenges: To Hadoop or Not to Hadoop? (Cambridge) - Thursday, June 5


Big Data Kickoff in Vancouver (Downtown) (Vancouver) - Tuesday, June 3

Big Data Kickoff in Vancouver (Burnaby) (Vancouver) - Wednesday, June 4


A Deep Dive into Apache Drill - Fast Interactive SQL on Hadoop (Munich) - Thursday, June 5


Cluster de Hadoop con BigSQL (Madrid) - Friday, June 6


Bangalore Baby Hadoop Meetup (Bangalore) - Saturday, June 7