Data Eng Weekly

Hadoop Weekly Issue #98

07 December 2014

There’s a lot to cover from the past two weeks, including several new releases: Apache Hadoop 2.6.0, Apache Ambari 1.7.0, and updates from Cloudera, MapR, and Hortonworks. Technical posts from LinkedIn, Pinterest, and Spotify all give a glimpse into the data infrastructure of those companies. There’s also coverage of the Hortonworks IPO and Apache Drill, which graduated from the Apache incubator this week.


LinkedIn has published a technical overview of Gobblin, their data ingestion framework. LinkedIn uses Gobblin to ingest data from Kafka, Databus (database change logs), and external partners in a unified fashion. It’s responsible for things like compaction and privacy compliance. The post promises that in the future Gobblin will be open-sourced.

The Cloudera blog has a post about BigBench, which is a specification-based benchmark for big data systems. Parts of BigBench are derived from TPC-DS, including the data model and ~1/3 of the queries. The post also details the upcoming BigBench 2.0 and includes some experimental results from running BigBench against 3 different server configurations.

The Hortonworks blog has a deeper look at features of the recently-released Hive 0.14. There are a number of notable new features, including ACID transactions, a cost-based optimizer, and temporary tables.

Apache Pig 0.14.0 was recently released with some exciting new features. This post looks at several those features and improvements: Pig on Tez, ORCStorage, Predicate pushdown for Load Functions, auto-shipping of UDF dependencies, and refactoring of jar artifacts.

The Amazon Web Services blog has a post covering how to build a data ingestion pipeline using Data Pipeline and Elastic MapReduce. The tutorial assumes that web servers publish data to an S3 bucket, after which it describes how to run Pig jobs to clean and filter the data and to generate reports using Hive.

This post covers how to use the Oozie workflow system and the Sqoop data transfer framework to import data from MySQL to Hive. The post has a 3-part walkthrough and a troubleshooting section that describes several common errors.

The Cloudera Impala team has written a paper describing Impala. It looks at the use-cases Impala aims to solve, gives an overview of the system architecture and main components, includes benchmarks with different file formats/compression, describes Impalas integration with YARN, and provides a comparison to other SQL-on-Hadoop systems as well as a commercial analytics db engine.

This in-depth tutorial looks at using Microsoft Azure’s HDInsight to run a Storm and HBase cluster to perform sensor analysis. The system consumes data from Event Hub and uses ASP.NET SignalR and D3.js to build a dashboard.

Apache Kafka 0.8.2 is currently in beta release with a final release planned for late this year. This post looks at several new features and improvements of the release, which include a new producer API, topic deletion, offset management in Kafka rather than Zookeeper, automated leader rebalancing, controlled shutdown, stronger durability guarantees, and connection quotas.

Spotify has written a post about how they’ve been migrating many of their MapReduce jobs from Hadoop Streaming with Python to Apache Crunch. Apache Crunch is based on Google’s FlumeJava, which has a number of nice properties (type-safety, Avro-support, simple testing, etc.). The post looks at how Spotify uses Crunch and introduces a new-open source project called crunch-lib, which contains a number of high-level operations.

One of the goals of YARN is to be a general-purpose compute framework for many applications. There have been a few examples so far, but they’re still mostly compute frameworks (e.g. Spark, Storm). This post looks at a new project from LucidWorks for running SolrCloud on YARN. It describes the YARN application lifecycle, how to launch a Solr cluster, and how to shut one down. It’s a great introduction to building a long-running application on YARN.

Among the many SQL-on-Hadoop frameworks is IBM’s Big SQL 3.0. This post compares Big SQL to Cloudera Impala 1.4 and Apache Hive 0.13 in two areas - how much effort is required in porting the SQL queries to each engine and a performance comparison. With the caveat that every vendor will try to show their system in the best light, the post claims that Big SQL is 3.6x faster than Impala (5.4x Hive) for single user at 10TB scale and 2.1x faster than Impala (8.5x Hive) with 4 concurrent streams.

Amazon Web Services has posted some benchmarks for Hadoop workloads using Elastic MapReduce (it looks at 6 node clusters, which is good start but not a comprehensive overview). They compare the current-generation instances to previous-generation instances as well as running against data stored in S3 and on an instance-backed HDFS. Current-generation instances are faster and cheaper, which makes the more cost-effective in most cases. The new instance types have a fraction of the instance storage, though.

As Hadoop deployments in the cloud become more economical and performant, we can expect to see more folks deploying Hadoop in the cloud in some shape or form. The Hortonworks blog has two posts about a hybrid cloud—the first details common use cases such as backup (to a blobstore like S3 or Azure Blob storage), development, and burst/overflow. The second describes how you can use the Microsoft Azure cloud towards these ends.

Pinterest describes their analytics system which is built on MySQL and HBase with the Flask web framework. For scalability, the system uses HBase with co-processors and secondary indexes. The post gives a high-level overview of how the HBase storage/compute works as well as how they build rolling-window data sets with Cascading jobs.


ZDNet has coverage of Hortonworks’ IPO. The company will trade on Nasdaq under “HDP,” and 6 million shares of stock are expected to be offered at between $12 and $14. The article has some more details on the IPO and the company based on the S-1 filing.

InfoWorld has an analysis of the Hortonworks IPO and the Hadoop industry at a whole. The reporting is based on the data in Hortonworks’ S-1 filing and previous statements about the company’s finances. The article discusses the role that open-source software has on the revenue and on the success of any Hadoop vendor.

Apache Drill, the schema-free SQL system which can query data on lots of different systems, has graduated from the Apache incubator. The Apache blog has more details on the project, and GigaOm research has more background on the SQL-on-Hadoop ecosystem and how Drill fits into it.

If you’re looking for some more information on Apache Drill, the MapR blog describes several use-cases that customers are targeting for Drill. These include self-service data (e.g. integration with Tableau or MicroStrategy), data agility (plugging into many data sources), interactive query response time, and ubiquity (integration with several systems such as Spark and MongoDB).

The edX massive online course organization is offering two courses for Apache Spark. “Introduction to Big Data with Apache Spark” will be taught by UC Berkeley professor Anthony Joseph and will start on February 23rd. “Scalable Machine Learning” will be taught by UCLA assistant professor Ameet Talwalkar and will start on April 14th.


Apache Hadoop 2.6.0 was released this week. The new release includes a number of improvements, bug fixes, and new features. These include a (beta) key management server, improved support for heterogeneous storage tiers, (beta) transparent encryption at rest, support for long-running services in YARN, support for rolling upgrades, (beta) support for running applications in docker containers, and much more. The Hortonwork’s blog has more details on several of the new features as well as a look forward to the Hadoop 2.7 release.

On the heals of Apache Hadoop 2.6.0 release, Hortonworks has announced HDP2.2 based on Hadoop 2.6.0 and with new versions of all core components. This release also adds several new systems which were previously in technical preview—Apache Spark, Apache Slider, Apache Kafka, and Apache Ranger. Other major improvements features of the release are support for Rolling Upgrades, automated cloud backup to Azure and S3, and phase 1 of (Hive 0.14.0).

If you’re looking to try out Hadoop 2.6.0, there’s a new docker image for the release. This post from SequenceIQ shows how to run the container and run an example Hadoop job.

MapR has announced upgrades to several systems included in its distribution. The new versions include Impala 1.4.1, Spark 1.1.0, Pig 0.13, Hive 0.13, and Sqoop 1.4.5.

DataStax has announced DataStax Enterprise 4.6, which provides an enterprise-ready Apache Cassandra and management software. The new version includes Apache Spark streaming analytics and security integration.

Apache Ambari 1.7.0 was released this week. It’s a pretty epic release resolving over 1,600 tickets. The Hortonworks blog has details on improvements and new features, which include improvements across operations, extensibility, and the core platform.

Cloudera announced several patch-level releases this week. Cloudera Enterprise 5.2.1 includes fixes to Oozie, YARN, Impala, Cloudera Manager, and Cloudera Navigator. Several previous releases (Cloudera Enterprise 5.1.4/5.0.5 and some CDH4 releases) were all patched to fix the POODLE vulnerability in SSL. Finally, new ODBC and JDBC drivers for CDH 5.2 were released with support for Hive 0.13, Impala 2.0, and more.

The presto-marathon-docker project provides tools for running Presto (the open-source SQL-on-Hadoop/S3/etc from Facebook) inside of docker and using Mesos to build an on-demand cluster.


Curated by Mortar Data ( )



Databricks Spark Meetup (Playa Vista) - Thursday, December 11

December SF Hadoop Users Meetup (San Francisco) - Wednesday, December 10


Apache Drill Intro: Data Exploration and Analytics on Hadoop (Houston) - Wednesday, December 10


Let's Talk about Hadoop! (Minnetonka) - Wednesday, December 10


Join Us @Chadoopers (Chattanooga) - Thursday, December 11


Hadoop, Pig & Hive (Harrisbug) - Tuesday, December 9

New York

Go Lightning Talks: Apache Kafka; Solving Problems with Bosun (New York) - Tuesday, December 9


Show & Tell: Winter Is Coming Edition (Boston) - Wednesday, December 10

Spark + Cassandra (Waltham) - Wednesday, December 10


What's the Scoop on Hadoop? (Ottawa) - Wednesday, December 10


Building Data Pipelines (Bristol) - Tuesday, December 9


Mobile Beacons and #IoT with SQL on Hadoop Featuring RSA Analytics (Melbourne) - Tuesday, December 9


Intro to Apache Spark: Paweł Szulc (Warsaw) - Tuesday, December 9

Lightning Talks (Warsaw) - Wednesday, December 10


A Machine Learning Pipeline for Event Detection on Spark (Goteborg) - Wednesday, December 10


How to Think in MapReduce (Cluj-Napoca) - Wednesday, December 10


Apache Flink and Generating Query Suggestions on Hadoop (Hoofddorp) - Thursday, December 11


Apache Spark (Buenos Aires) - Thursday, December 11

IRELAND Apache Spark and the Current Big Data Landscape (Dublin) - Thursday, December 11


Big Data and Real-time Analytics with Spark (Bangalore) - Friday, December 12

Practical MapReduce Programming Mini-Course (Chennai) - Saturday, December 13

Hadoop Hackathon (Gurgaon) - Saturday, December 13