Data Eng Weekly

Hadoop Weekly Issue #92

19 October 2014

With Strata + Hadoop World this week, there were a number of partnership announcements and software releases. Among them, Cloudera and Hortonworks released new versions of their distributions, MapR is bundling MapR-DB with their community edition, and Pivotal announced plans for the Tachyon project. There are also several good technical posts this week covering Sqoop, Kafka, Presto, Hive, and Scala as a language for data processing. I tried to cover the key news from the week but likely missed some stories given the Strata + Hadoop World tsunami. Please let me know if there’s something you think should be in next week’s newsletter.

Technical is a new blog focussing on data ingestion into Hadoop. I recommend catching up on all the posts published so far. They cover Flume v. Kafka, the design and features of Sqoop2, the Kafka High-level consumer, and a recap of this week’s Kafka Meetup in NYC.

This newsletter has had a lot of coverage of the work done by the folks at SequenceIQ on dockerizing Hadoop. In fact, they’ve been up to so much that it can be hard to see the whole picture through a series of posts. The Hortonworks blog has a guest post in which SequenceIQ summarizes their platform—Cloudbreak for provisioning clusters and Periscope for SLA enforcement and autoscaling.

This post describes some of the roadblocks in setting up the latest version of Apache Sqoop (1.99.3) and how to get past them. It serves as a pre-walkthrough to the Sqoop in 5 minutes tutorial from the official Sqoop documentation.

Apache Phoenix has gained momentum recently as a SQL engine for HBase. The Hortonworks blog has some notes on integrating Phoenix with Hive, which can also do SQL over data stored in HBase (but with an emphasis on batch as opposed to OLTP). Plans include a unified SQL layer which can delegate to either Phoenix or Hive, a shared metadata repository, and a shared transaction manager.

The Scala language has been adding adopters for years—especially as several popular distributed systems are written in Scala (e.g. Kafka and Spark). This post discusses three reasons that Scala should be your go to language for data engineering / processing at scale.

The Hortonworks blog describes Ozone, an object store that it plans to add to HDFS. An object store (Amazon S3 is probably the best known example) has different requirements than a file system, such as support for large numbers of objects (much more than the number of files HDFS can support), a simple REST API, and cross-datacenter replication.

This post is a brief intro to the Hadoop metrics framework. Specifically, it includes snippets of both the registration of and export of (via FileSink and web services) metrics.

Cloudera introduced Cloudera Director this week for running Hadoop clusters in the cloud. The AWS big data blog has a post describing how to build a cluster in AWS with Cloudera Director and Cloudformation. The post describes two possible topologies in an AWS Virtual Private Cloud, how to configure the cluster, how to deploy it, and how to terminate the cluster.

The Qubole blog has a guest post by MediaMath on their experiences with Presto, the big data SQL framework from Facebook. The post includes a performance comparison of Presto vs Hive on data (presumably real data from MediaMath, not synthetic data) stored in Amazon S3. Results show that Presto is ~3x faster than Hive on average, and 5x faster when caching (a Qubole-only speedup) is enabled.

The Cloudera blog has two posts on new features of CDH 5.2, which was released this week (more on the release below). The first covers Impala, which has gained support for several analytics functions, two new datatypes (VARCHAR and CHAR), support for spilling to disk when the query doesn’t fit in RAM, and more. The second covers Apache Sentry, which adds the GRANT keyword (to allow a user to grant privileges to there users) and the REVOKE keyword to remove the privilege.


The DBMS2 blog has a post about Cloudera’s product offering. It serves as a glossary of all the products and buzz words surrounding Cloudera’s products. The post is pre-Strata + Hadoop World, so it doesn’t include any newly announced products (such as Cloudera Director).

This post discusses the history and architecture of the Apache incubator project, Flink (formerly Stratosphere). The post argues that Flink is in a better position that most big data query engines because it contains a cost-based optimizer for unstructured data and can unify real-time processing with analysis of historic data. In terms of real-time, the post compares Flink with Spark streaming (which only does micro batch).

MapR announced that they’re planning to integrate Apache Drill, the data exploration platform, with Apache Spark. Given recent news related to Spark (e.g. efforts to get Hive running on Spark), this is another vote for Spark as the successor to MapReduce.

This post opens with an observation that I struggle with every week as I find content for this newsletter: “it’s getting hard to pinpoint what, exactly, Hadoop is.” It points out that all the moving pieces and flexibility of Hadoop can make it difficult to deploy and operate. This in turn is a big opportunity for folks selling to enterprises.

Cloudera has started “Cloudera Labs” for incubation of project inside of Cloudera Engineering. The initial set of projects include Kafka, Hive-on-Spark, Impyla (a python client for Impala), and Oryx (an implementation for the lambda architecture).

The DBMS2 blog has a post-Strata + Hadoop Wold article on Cloudera’s announcements this week. Key observations include the large number of business partnerships announced by Cloudera this week and that they’re becoming more cloud friendly.

The number of partnerships and announcements this week from Cloudera is a bit overwhelming. Many are covered elsewhere in the newsletter, but the full list is indexed in the Cloudera press center.

Pivotal’s distribution, Pivotal HD, has included support for Spark since May. They’ve announced plans to take their commitment to in-memory computing a step further by partnering with UC Berkeley’s AMPLab to further develop the Tachyon in-memory distributed file system.

Datanami has coverage recently releases by big data software vendors in which MapReduce is replaced with next generation processing systems. Of the companies profiled, four have moved to Spark while one has moved to Tez. Regardless of if Spark or Tez is winning, it’s clear that MapReduce is becoming less common.

A new Gartner research note on comparing Hadoop distributions has been published. While the full report is behind a paywall, this post describes the note’s key findings and recommendations. They include: vendor lock-in isn’t a large concern, Gartner expects new Hadoop ecosystem technologies soon, and Hadoop is becoming the de facto system for cluster management.

“Time Series Databases” is a new book written by some folks at MapR and being published by O’Reilly. The book looks at open-source tools for time series data—specifically OpenTSDB and Grafana. It also covers using MapR-DB as a backend to OpenTSDB. MapR is sponsoring a free download (behind a email-wall).’s-about-time-time-Series-Databases-New-Ways-to-Store-and-Access-Data

This article considers the pros and cons of various ways to build an analytics platform with Hadoop. Options include Hadoop as a source of truth from which a data warehouse is populated, a parallel data warehouse, Hadoop on an appliance, and analytics directly from Hadoop. The post also includes suggestions for successfully using Hadoop as an analytics platform.


Mortar, makers of the Pig-as-a-Service platform, have announced integration with Luigi. Luigi is an open-source workflow management tool originally written at Spotify. Mortar’s introductory blog post explains some of the advantages of Luigi, details the integrations they’ve built for it, and links to a tutorial for getting started.

VMWare’s vSphere Big Data Extensions (BDE) 2.1 includes integration with Cloudera Manager and Apache Ambari for provisioning Hadoop clusters. After provisioning VMs, BDE makes API calls to the management software to build and configure the cluster. There’s much more information about the integration in the blog post below.

Protegrity Avatar is a new system for data protection in HDP. It supports encryption at rest and fine-grained access controls for Hive, Pig, HBase, and MapReduce.

Cloudera has released Cloudera Enterprise 5.2. The announcement highlights several improvements in the release—security (including the fruits of joint work with Intel), data management & governance, cloud deployment, and more. The release includes new versions of HBase (0.98.6), Apache Spark (1.1), Impala (2.0), and several other components. Apache Kafka integration is also available via Cloudera Labs.

Hortonworks released HDP 2.2, the next release of their distribution. Release highlights include phase 1 of to improve performance of and add (simple) transactions to Hive, Spark on YARN, the inclusion of Kafka, Apache Ranger (previously Argus) for cluster security, and support for cloud backup. There’s a much more complete overview of the release, which features new versions of every component of the distribution, on the Hortonworks blog.

MapR announced this week that they are including MapR-DB within the MapR Community Edition. MapR-DB implements the HBase API but is built with a different architecture (which leverages the MapR FileSystem).

Microsoft announced this week that Azure HDInsight, a Hadoop-as-a-Service system, is adding support for Apache Storm. The integration is available in preview form starting now. Also, they expect to land support for HDP 2.2 on HDInsight in November.

Action announced a free community version of their Actian Analytics Platform. Action’s SQL-in-Hadoop system stores data in HDFS but doesn’t interoperate with the rest of the Hadoop ecosystem. The community version is free for an unlimited number of nodes and up to 500GB of data.

Rackspace has announced the OnMetal Cloud Big Data Platform, which is used to run a bare-metal Hadoop/Spark cluster. This is an interesting product that lies between a dedicated cluster and one running in the cloud on virtualized hardware.

Pivotal announced new versions of GemFire XD and SQLFire. GemFire XD is a distributed database that runs atop of Pivotal HD. Both releases include improved integration with HDFS.


Curated by Mortar Data ( )



Storage Solutions for Big Data with Hadoop Architect Sameer Tiwari (Palo Alto) - Tuesday, October 21

Hadoop Effortlessly: A Data Inventory Is Key to Data Self-Service (Sunnyvale) - Thursday, October 23

Data Science Camp @ Bay Area ACM (San Jose) - Saturday, October 25


Data Science Using Big R for in-Hadoop Analytics (Las Vegas) - Sunday, October 26


Drill Down into Apache Drill! Plus, Pinsight Media + Hadoop and Hive Use Case! (Overland Park) - Thursday, October 23


Resource Management in Modern Hadoop Clusters (Saint Louis) - Tuesday, October 21


Welcome to the Nashville Cloudera User Group (Nashville) - Thursday, October 23


Hands-on MapReduce and Spark Programming by Roger Ding (McLean) - Wednesday, October 22


Escape from Hadoop: Spark Streaming, Cassandra, Scala & Akka, with Helena Edelson (Philadelphia) - Tuesday, October 21


Apache Spark in Four Parts (Annapolis Junction) - Tuesday, October 21


Real-Time Analytics in Hadoop, and Hadoop in 2015 (Saint Petersburg) - Wednesday, October 22

New York

Index-Based SQL-on-Hadoop: An Architectural Comparison of Tools (New York) - Monday, October 20


ADAM, Spark, and Tachyon (Cambridge) - Monday, October 20


HBase, What's It All About? (Colchester) - Friday, October 24


Hadoop Continued: Hive and Spark + Experiences (Vienna) - Tuesday, October 21


Disruptive Applications and Hadoop... on the Cloud (Vancouver, BC) - Tuesday, October 21

Connecting Visual Analytics Tools to Enterprise Big Data with Spark SQL (Vancouver, BC) - Thursday, October 23


"Apache Spark 101" with Paweł Szulc (Wroclaw) - Tuesday, October 21


Scala.IO (Paris) - Wednesday, October 22 and Thursday, October 23


High-Availability Hadoop and Apache Cassandra (Sydney) - Wednesday, October 22


Big Data/Data Science Meetup (Cluj-Napoca) - Thursday, October 23


Shanghai Spark Meetup, with Jason Dai (Shanghai) - Saturday, October 25

MLlib and Distributed Machine Learning (Beijing) - Sunday, October 26