Data Eng Weekly


Hadoop Weekly Issue #77

06 July 2014

I was expecting a dearth of content to match the short week in the US for July 4th. But with Spark Summit this week in San Francisco, there were a number of partnerships, new tools, and other announcements. Both Databricks and MapR announced influxes of cash this week, and there was a lot of discussion about the future of Hive given a joint announcement by Cloudera, Databricks, IBM, Intel, and MapR to build a new Spark backend for Hive. In addition to that, Apache Hadoop 2.4.1 was released, Apache Pig 0.13.0 was released, and Flambo, a new clojure DSL for Spark was unveiled.

Technical

Pivotal HD and HAWQ support Parquet field natively in HDFS. This tutorial shows how to build a parquet-backed table with HAWQ and then access the data stored in HDFS using Apache Pig.

http://www.pivotalguru.com/?p=727

Spark Summit was this week in San Francisco. Slides from the presentations (there are over 50) have been posted on the summit website. In addition to keynotes, there are three tracks—Applications, Developer, and Data Science.

http://spark-summit.org/2014/agenda

This article proposes an alternative to the Lambda Architecture. For those not familiar, the Lambda Architecture is an idea of combining batch and real-time workloads to build course-correcting streaming applications. The alternative, from Jay Kreps (who builds data infrastructure using Kafka and Samza at LinkedIn), is to use the stream-processing framework to backfill data (thus performing the role of batch in the Lambda Architecture). The article discusses the trade-offs and benefits of using the Lambda Architecture vs. a stream processing framework for everything.

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

The altiscale blog has a post on event transport for Hadoop. It gives an introduction to the problem that systems like Apache Flume and Apache Kafka are solving—namely moving data from applications to durable storage in Hadoop. The post also talks about the processing models of Flume and Kafka and the different tradeoffs of the two.

https://www.altiscale.com/event-transport-hadoop/

Altiscale has the first two parts of a three part blog series on Apache Oozie. The first covers how to use wildcards in path expansion for Oozie datasets (there are several gotchas). The second covers using Oozie to run Hadoop streaming jobs (written in Ruby and Python). They show how to dump the environment (useful for debugging), how to configure Oozie to support custom ruby gems in streaming jobs, and how to build a simple MultipleTextOutputFormat subclass support multiple outputs from streaming jobs.

https://www.altiscale.com/wildcards-oozie-2/ https://www.altiscale.com/running-streaming-jobs-oozie/

Pivotal has posted benchmark numbers of their HAWQ system for SQL on Hadoop. The analysis used a 10 node cluster running RHEL 6.2. They compared Impala 1.1.1, Presto 0.52, Hive 0.12, and HAWQ 1.1. Pivotal HAWQ shows average 6x performance improvement over Impala and a 21x speedup over Hive (like most vendor benchmarks, the results should be taken with a grain of salt). The post also touts the SQL compliance of HAWQ, which allows it support many more TCP-DS queries than other systems.

http://blog.gopivotal.com/pivotal/products/pivotal-hawq-benchmark-demonstrates-up-to-21x-faster-performance-on-hadoop-queries-than-sql-like-solutions

This article contains an overview of YARN and YARN schedulers with a focus for HPC audiences. After an intro to YARN architecture, the post describes 11 types of scheduling options familiar to users of HPC systems, many of which aren’t yet available in YARN. After that, it dives into the details of the YARN capacity and fair schedulers.

http://www.linuxjournal.com/content/how-yarn-changed-hadoop-job-scheduling

This presentation discusses Twitter’s experiences with running Spark at scale. For evaluation, they built a 35 node YARN cluster with Spark 0.8.1 and compared it to Pig and Scalding. They found that Spark produced a 3-4x wall-clock speedup over Pig and a 2-3x speedup vs. scalding. They mentioned that tuning Spark jobs required a good understanding of the system, and that there were some limitations for productionization inside of YARN (but that more recent versions of Spark are aiming to address these).

http://www.slideshare.net/krishflix/seattle-spark-meetup-spark-at-twitter

Cloudera, MapR, Intel, IBM and Databricks announced a partnership to build a new Spark backend for Hive (more about that below). This post discusses the technical details and motivation for the new project. One of the main motivations is to help Spark shops have a single backend in place (rather than also requiring MapReduce or Tez). The article discusses Query Planning, Job Execution, and the main design considerations of the implementation.

http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/

The Gartner blog has a post about how Hadoop development tools have been falling behind while the ecosystem concentrates efforts on SQL-on-Hadoop. It mentions four areas—development tools, application deployment, testing and debugging, and integrating with non-HDFS sources. There are some projects working on these areas, but there hasn’t been significant improvement.

http://blogs.gartner.com/nick-heudecker/dontforgetthehadoopdevelopers/

News

MapR announced $110 million in financing this week. Google Capital led the round with $80 million (the other $30 was debt financing). InfoWorld has more details on the deal, including MapR’s popularity in enterprise and its expertise in machine learning.

http://www.infoworld.com/t/hadoop/how-much-hadoop-worth-google-80-million-245233

Databricks announced $33 million in series B funding and a new cloud platform. The funding round was led by New Enterprise Associates (NEA). The cloud platform provides an easy way to deploy Spark in Amazon Web Services with expansion to more cloud providers on the roadmap. It provides notebooks, dashboards, and a job launcher.

http://databricks.com/blog/2014/06/30/databricks-unveils-spark-based-cloud-platform.html

Pentaho and Databricks announced an integration between Pentaho and Apache Spark. The integration currently includes support for ETL and Reporting, and they’re working on a new backend for their Weka machine learning suite built on Spark.

http://databricks.com/blog/2014/06/30/application-spotlight-pentaho.html

Alteryx and Databricks announced a collaborative effort to work on SparkR. SparkR is a Spark backend to the R analytics system providing distributed computation.

http://www.alteryx.com/press-releases/alteryx-and-databricks-to-lead-development-of-apache-sparkr-for-scalable-hadoop

Fortune has the story of Hadoop’s birth at Yahoo as part of the Nutch project. It features interviews with Hadoop co-founders Doug Cutting and Mike Cafarella, who say they never anticipated the demand for Hadoop, which is driving a $50 billion market. It also discusses the role of open-source in Hadoop’s success, and how Cutting is now working on updating policy for big data.

http://fortune.com/2014/06/30/hadoop-how-open-source-project-dominate-big-data/

DataStax and Hortonworks announced that DataStax completed Hortonworks Certification for HDP.

http://hortonworks.com/blog/datastax-certified-hortonworks-data-platform/

Datanami has coverage of Hortonworks’ certification of Apache Spark on YARN. The article features an interview with Arun C. Murthy and Shaun Connolly of Hortonworks where they discuss the process of evaluating a new system for YARN and new features (such as node labels) they’re adding to YARN for optimizing jobs run on different systems.

http://www.datanami.com/2014/06/26/apache-spark-gets-yarn-approval-hortonworks/

Databricks and SAP announced a partnership this week. As part of the deal, Databricks will certify Spark to run on SAP HANA. The Databricks blog has more details on the partnership.

http://databricks.com/blog/2014/07/01/integrating-spark-and-hana.html

This post summaries the highlights from this week’s Spark Summit. In addition to big announcements from Datastax, Databricks, and more, the post discusses the growth of the summit (450 -> 1000+ attendees), some of the keynotes, and vendor turnout.

http://thomaswdinsmore.com/2014/07/03/spark-summit-2014-roundup/

MapReduce and Hadoop have been tied together for most of the Hadoop’s history. But with the introduction of YARN, MapReduce is just one of the applications. This article points out that Google’s recent revelations about MapReduce don’t mean the end of Hadoop. The author also argues that Google’s new Cloud Dataflow also isn’t meant to be a replacement for Hadoop (especially given Google’s investment in MapR this week).

http://www.datacenterknowledge.com/archives/2014/07/03/hadoop-mapreduce-ties-broken-dataflow-not-a-hadoop-killer/

WANdisco, who specializes in uptime for distributed systems, announced that they’ve acquired OhmData, makers of the C5 database. The C5 database is compatible with HBase APIs but providers different trade-offs and features.

http://gigaom.com/2014/06/30/hadoop-specialist-wandisco-acquires-hbase-like-startup-ohmdata/

Cloudera, Databricks, IBM, Intel, and MapR announced at Spark Summit a partnership to build a new Spark backend for Hive. This announcement caused a lot of confusion and speculation around the companies product offerings—particularly around Cloudera and Impala. The Register has coverage of the initial announcement including reactions from Hortonworks. The Cloudera blog has a post describing their vision for a future in which Cloudera Impala and Hive on Spark exist concurrently—the former for interactive queries and BI tools and the latter for everything else.

http://www.theregister.co.uk/2014/06/30/cloudera_and_co_spark/ http://vision.cloudera.com/broadening-support-for-apache-spark/

To add confusion to the announcement of Hive on Spark, Databricks announced that they’re no longer planning to support Shark, which is the original project for Hive on Spark (the new project will be a rewrite taking advantage of changes to the Hive APIs introduced in order to support Apache Tez as a backend). On top of that, they believe that Spark SQL, their system for invoking SQL queries from a Spark job, is the future of SQL on Spark. The post also acknowledges the need for Hive on Spark, which adds further complication to the discussion.

http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html

A post on the Hortonworks blog tells the tale of Hadoop Then, Now, and Next. It describes traditional Hadoop based on HDFS and MapReduce, the arrival of YARN (and declares that Traditional Hadoop, built on mappers and reducers, is dead) as the basis for Enterprise Hadoop, and discusses how YARN will power the future of Hadoop.

http://hortonworks.com/blog/enterprise-hadoop-whats-next-data-management/

Releases

Apache Hadoop 2.4.1 was released. The new version contains a number of bug fixes include a security fix for HDFS admin sub-commands.

http://mail-archives.apache.org/mod_mbox/hadoop-general/201406.mbox/%3C95C9844E-FB00-4DFB-BECF-5C49E1A727F4%40hortonworks.com%3E http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/releasenotes.html

Sparkling Water is a new system combing OxData’s H20 with Apache Spark. H20 is an open-source machine learning framework for big data. It supports a number of algorithms for data science including k-means, random forest, stochastic gradient descent, and naive bayes. It previously supports a stand-alone cluster or running on Hadoop, and Sparking Water adds Spark as a runtime.

http://databricks.com/blog/2014/06/30/sparkling-water-h20-spark.html

Pydoop 0.12 was released with support for YARN and CDH 4.4/4.5.

http://mail-archives.apache.org/mod_mbox/hadoop-

mapr-sandbox-base is a docker image for running the MapR sandbox in docker.

https://registry.hub.docker.com/u/maprtech/mapr-sandbox-base/general/201407.mbox/%3C53B40660.5000903%40crs4.it%3E

Apache Pig 0.13.0 was released. The release contains a number of new features and performance improvements. Among the most interesting features are a pluggable execution engine and auto-local mode.

http://mail-archives.apache.org/mod_mbox/pig-user/201407.mbox/%3CCAB2zpW9cqbeMbuVFg7WS8%3DoOwHC5Or6Xi7ELkvFB91P8U-7yxA%40mail.gmail.com%3E

Flambo, which was open-sourced this week by Yieldbot, is a new project that provides a Clojure DSL for Apache Spark. Flambo’s README provides examples of using the idiomatic Clojure API.

https://github.com/yieldbot/flambo

MapR announced support for new versions of Hive, Httpfs, Mahout, and Pig. All are available for MapR 3.0.3, 3.1.1, and 4.0.0 FCS.

http://www.mapr.com/blog/apache-open-source-projects-release-update

The cassandra-driver-spark project is a new project from DataStax to integrate Cassandra with Apache Spark. With the driver, it’s possible to store a Spark RRD into Cassandra with a single statement.

https://github.com/datastax/cassandra-driver-spark

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

Unlimited Analytics in Hadoop with Actian Vector (San Francisco) - Wednesday, July 9
http://www.meetup.com/SF-Data-Warehouse-Group/events/188742072/

Deep Dive Apache Drill: Building Highly Flexible, High Performance Query Engines (Menlo Park) - Thursday, July 10
http://www.meetup.com/Hadoop-Talks/events/180632322/

Hadoop: Past, Present and Future (Irvine) - Thursday, July 10
http://www.meetup.com/Orange-County-Java-Users-Group-OCJUG/events/192610022/

Texas

Extending Apache Ambari (Houston) - Wednesday, July 9
http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/188066532/

Utah

Big Data Utah Meeting @ IHC - Discussion on Architecture and Best Practices (Salt Lake City) - Wednesday, July 9
http://www.meetup.com/BigDataUtah/events/191685552/

Colorado

Graph Processing with Hadoop & HBase by Brandon Vargo, Senior Platform Engineer (Boulder) - Thursday, July 10
http://www.meetup.com/Graph-Nerds-of-Boulder/events/192207712/

Kansas

MapR Talks Apache Spark & Tableau's Rel.8.2 (Kansas City) - Thursday, July 10
http://www.meetup.com/Kansas-City-Big-Data-Projects-Group/events/182908202/

Georgia

Hey Hadoop, Meet Apache Spark! (Atlanta) - Wednesday, July 9
http://www.meetup.com/Atlanta-Hadoop-Users-Group/events/181968972/

Washington, D.C.

MapR: Security and Hadoop Discussion (Followed by Happy Hour and Networking) (Washington) - Thursday, July 10
http://www.meetup.com/Hadoop-DC/events/187536342/

CANADA

Introduction to Apache Spark (Toronto) - Tuesday, July 8
http://www.meetup.com/TorontoHUG/events/191210182/

SQL on Hadoop Party - Downtown Session 1 (Vancouver) - Thursday, July 10
http://www.meetup.com/Big-Data-Developers-in-Vancouver/events/189972172/

SQL on Hadoop Party - Burnaby Session 3 (Burnaby, B.C.) - Friday, July 11
http://www.meetup.com/Big-Data-Developers-in-Vancouver/events/189972822/

INDIA

Hadoop by Use Case and Example (Hyderabad) - Saturday, July 12
http://www.meetup.com/hyderabad-scalability/events/182572882/