Data Eng Weekly

Hadoop Weekly Issue #77

06 July 2014

I was expecting a dearth of content to match the short week in the US for July 4th. But with Spark Summit this week in San Francisco, there were a number of partnerships, new tools, and other announcements. Both Databricks and MapR announced influxes of cash this week, and there was a lot of discussion about the future of Hive given a joint announcement by Cloudera, Databricks, IBM, Intel, and MapR to build a new Spark backend for Hive. In addition to that, Apache Hadoop 2.4.1 was released, Apache Pig 0.13.0 was released, and Flambo, a new clojure DSL for Spark was unveiled.


Pivotal HD and HAWQ support Parquet field natively in HDFS. This tutorial shows how to build a parquet-backed table with HAWQ and then access the data stored in HDFS using Apache Pig.

Spark Summit was this week in San Francisco. Slides from the presentations (there are over 50) have been posted on the summit website. In addition to keynotes, there are three tracks—Applications, Developer, and Data Science.

This article proposes an alternative to the Lambda Architecture. For those not familiar, the Lambda Architecture is an idea of combining batch and real-time workloads to build course-correcting streaming applications. The alternative, from Jay Kreps (who builds data infrastructure using Kafka and Samza at LinkedIn), is to use the stream-processing framework to backfill data (thus performing the role of batch in the Lambda Architecture). The article discusses the trade-offs and benefits of using the Lambda Architecture vs. a stream processing framework for everything.

The altiscale blog has a post on event transport for Hadoop. It gives an introduction to the problem that systems like Apache Flume and Apache Kafka are solving—namely moving data from applications to durable storage in Hadoop. The post also talks about the processing models of Flume and Kafka and the different tradeoffs of the two.

Altiscale has the first two parts of a three part blog series on Apache Oozie. The first covers how to use wildcards in path expansion for Oozie datasets (there are several gotchas). The second covers using Oozie to run Hadoop streaming jobs (written in Ruby and Python). They show how to dump the environment (useful for debugging), how to configure Oozie to support custom ruby gems in streaming jobs, and how to build a simple MultipleTextOutputFormat subclass support multiple outputs from streaming jobs.

Pivotal has posted benchmark numbers of their HAWQ system for SQL on Hadoop. The analysis used a 10 node cluster running RHEL 6.2. They compared Impala 1.1.1, Presto 0.52, Hive 0.12, and HAWQ 1.1. Pivotal HAWQ shows average 6x performance improvement over Impala and a 21x speedup over Hive (like most vendor benchmarks, the results should be taken with a grain of salt). The post also touts the SQL compliance of HAWQ, which allows it support many more TCP-DS queries than other systems.

This article contains an overview of YARN and YARN schedulers with a focus for HPC audiences. After an intro to YARN architecture, the post describes 11 types of scheduling options familiar to users of HPC systems, many of which aren’t yet available in YARN. After that, it dives into the details of the YARN capacity and fair schedulers.

This presentation discusses Twitter’s experiences with running Spark at scale. For evaluation, they built a 35 node YARN cluster with Spark 0.8.1 and compared it to Pig and Scalding. They found that Spark produced a 3-4x wall-clock speedup over Pig and a 2-3x speedup vs. scalding. They mentioned that tuning Spark jobs required a good understanding of the system, and that there were some limitations for productionization inside of YARN (but that more recent versions of Spark are aiming to address these).

Cloudera, MapR, Intel, IBM and Databricks announced a partnership to build a new Spark backend for Hive (more about that below). This post discusses the technical details and motivation for the new project. One of the main motivations is to help Spark shops have a single backend in place (rather than also requiring MapReduce or Tez). The article discusses Query Planning, Job Execution, and the main design considerations of the implementation.

The Gartner blog has a post about how Hadoop development tools have been falling behind while the ecosystem concentrates efforts on SQL-on-Hadoop. It mentions four areas—development tools, application deployment, testing and debugging, and integrating with non-HDFS sources. There are some projects working on these areas, but there hasn’t been significant improvement.


MapR announced $110 million in financing this week. Google Capital led the round with $80 million (the other $30 was debt financing). InfoWorld has more details on the deal, including MapR’s popularity in enterprise and its expertise in machine learning.

Databricks announced $33 million in series B funding and a new cloud platform. The funding round was led by New Enterprise Associates (NEA). The cloud platform provides an easy way to deploy Spark in Amazon Web Services with expansion to more cloud providers on the roadmap. It provides notebooks, dashboards, and a job launcher.

Pentaho and Databricks announced an integration between Pentaho and Apache Spark. The integration currently includes support for ETL and Reporting, and they’re working on a new backend for their Weka machine learning suite built on Spark.

Alteryx and Databricks announced a collaborative effort to work on SparkR. SparkR is a Spark backend to the R analytics system providing distributed computation.

Fortune has the story of Hadoop’s birth at Yahoo as part of the Nutch project. It features interviews with Hadoop co-founders Doug Cutting and Mike Cafarella, who say they never anticipated the demand for Hadoop, which is driving a $50 billion market. It also discusses the role of open-source in Hadoop’s success, and how Cutting is now working on updating policy for big data.

DataStax and Hortonworks announced that DataStax completed Hortonworks Certification for HDP.

Datanami has coverage of Hortonworks’ certification of Apache Spark on YARN. The article features an interview with Arun C. Murthy and Shaun Connolly of Hortonworks where they discuss the process of evaluating a new system for YARN and new features (such as node labels) they’re adding to YARN for optimizing jobs run on different systems.

Databricks and SAP announced a partnership this week. As part of the deal, Databricks will certify Spark to run on SAP HANA. The Databricks blog has more details on the partnership.

This post summaries the highlights from this week’s Spark Summit. In addition to big announcements from Datastax, Databricks, and more, the post discusses the growth of the summit (450 -> 1000+ attendees), some of the keynotes, and vendor turnout.

MapReduce and Hadoop have been tied together for most of the Hadoop’s history. But with the introduction of YARN, MapReduce is just one of the applications. This article points out that Google’s recent revelations about MapReduce don’t mean the end of Hadoop. The author also argues that Google’s new Cloud Dataflow also isn’t meant to be a replacement for Hadoop (especially given Google’s investment in MapR this week).

WANdisco, who specializes in uptime for distributed systems, announced that they’ve acquired OhmData, makers of the C5 database. The C5 database is compatible with HBase APIs but providers different trade-offs and features.

Cloudera, Databricks, IBM, Intel, and MapR announced at Spark Summit a partnership to build a new Spark backend for Hive. This announcement caused a lot of confusion and speculation around the companies product offerings—particularly around Cloudera and Impala. The Register has coverage of the initial announcement including reactions from Hortonworks. The Cloudera blog has a post describing their vision for a future in which Cloudera Impala and Hive on Spark exist concurrently—the former for interactive queries and BI tools and the latter for everything else.

To add confusion to the announcement of Hive on Spark, Databricks announced that they’re no longer planning to support Shark, which is the original project for Hive on Spark (the new project will be a rewrite taking advantage of changes to the Hive APIs introduced in order to support Apache Tez as a backend). On top of that, they believe that Spark SQL, their system for invoking SQL queries from a Spark job, is the future of SQL on Spark. The post also acknowledges the need for Hive on Spark, which adds further complication to the discussion.

A post on the Hortonworks blog tells the tale of Hadoop Then, Now, and Next. It describes traditional Hadoop based on HDFS and MapReduce, the arrival of YARN (and declares that Traditional Hadoop, built on mappers and reducers, is dead) as the basis for Enterprise Hadoop, and discusses how YARN will power the future of Hadoop.


Apache Hadoop 2.4.1 was released. The new version contains a number of bug fixes include a security fix for HDFS admin sub-commands.

Sparkling Water is a new system combing OxData’s H20 with Apache Spark. H20 is an open-source machine learning framework for big data. It supports a number of algorithms for data science including k-means, random forest, stochastic gradient descent, and naive bayes. It previously supports a stand-alone cluster or running on Hadoop, and Sparking Water adds Spark as a runtime.

Pydoop 0.12 was released with support for YARN and CDH 4.4/4.5.

mapr-sandbox-base is a docker image for running the MapR sandbox in docker.

Apache Pig 0.13.0 was released. The release contains a number of new features and performance improvements. Among the most interesting features are a pluggable execution engine and auto-local mode.

Flambo, which was open-sourced this week by Yieldbot, is a new project that provides a Clojure DSL for Apache Spark. Flambo’s README provides examples of using the idiomatic Clojure API.

MapR announced support for new versions of Hive, Httpfs, Mahout, and Pig. All are available for MapR 3.0.3, 3.1.1, and 4.0.0 FCS.

The cassandra-driver-spark project is a new project from DataStax to integrate Cassandra with Apache Spark. With the driver, it’s possible to store a Spark RRD into Cassandra with a single statement.


Curated by Mortar Data ( )



Unlimited Analytics in Hadoop with Actian Vector (San Francisco) - Wednesday, July 9

Deep Dive Apache Drill: Building Highly Flexible, High Performance Query Engines (Menlo Park) - Thursday, July 10

Hadoop: Past, Present and Future (Irvine) - Thursday, July 10


Extending Apache Ambari (Houston) - Wednesday, July 9


Big Data Utah Meeting @ IHC - Discussion on Architecture and Best Practices (Salt Lake City) - Wednesday, July 9


Graph Processing with Hadoop & HBase by Brandon Vargo, Senior Platform Engineer (Boulder) - Thursday, July 10


MapR Talks Apache Spark & Tableau's Rel.8.2 (Kansas City) - Thursday, July 10


Hey Hadoop, Meet Apache Spark! (Atlanta) - Wednesday, July 9

Washington, D.C.

MapR: Security and Hadoop Discussion (Followed by Happy Hour and Networking) (Washington) - Thursday, July 10


Introduction to Apache Spark (Toronto) - Tuesday, July 8

SQL on Hadoop Party - Downtown Session 1 (Vancouver) - Thursday, July 10

SQL on Hadoop Party - Burnaby Session 3 (Burnaby, B.C.) - Friday, July 11


Hadoop by Use Case and Example (Hyderabad) - Saturday, July 12