Data Eng Weekly

Hadoop Weekly Issue #107

08 February 2015

Apache Hive released version 1.0.0, Apache Kafka released version 0.8.2, and Yahoo open-sourced their Kafka management web tool. There was a lot of industry action with Cloudera acquiring and DataStax acquiring Aurelius, the makers of TitanDB. In addition to all of that, there are plenty of great technical articles.


Hortonworks has published a new round of benchmarks for Apache Hive 0.14, which is the first version with a cost-based optimizer (CBO). On the TPC-DS benchmark, they see an average speedup of 3x across all queries. In addition to those numbers, they've also analyzed the effect of the CBO on query plans (e.g. if join order was modified or there was a predicate push down).

The Parquet file format has gained a lot of traction since being announced as a joint project between Cloudera, Twitter, and others. Parquet is a column-oriented format, which means that data is stored differently than one is used to. This post gives an overview of the building-blocks of a Parquet file: row groups and column chunks. From there, the post gives three guidelines for working with Parquet files.

"The morning paper" is a blog which summaries a new CS paper every weekday morning. This week, the blog highlighted five papers from the 2015 Conference on Innovative Data Systems Research (CIDR). Selections include Liquid, LinkedIn's system for unifying nearline and offline big data systems (built using Kafka, Samza, and Hadoop), and Impala, Cloudera's open-source SQL system for Hadoop.

A lot of companies store analytics data as JSON, since it's an easy to use and ubiquitous format. Spark SQL has embraced this, and offers built-in support for JSON. This post looks at programmatically loading a JSON file into Spark SQL (which can infer the schema by scanning the dataset), how JSON data types map to SQL, and more.

The Hortonworks blog has a post describing new features in HDP 2.2 and YARN to support long-running applications. Areas of focus include fault-tolerence in the face of ApplicationMaster failure, security (since delegate tokens expire after 24 hours), log handling, service registry/discovery, and resource-isolation/scheduling. In addition, Apache Slider (incubating) is used to reduce the amount of effort required to deploy an existing distributed application in YARN. Apache HBase, Accumulor, and Storm are all supported via Slider on YARN in HDP 2.2.

MapR has a blog post describing the differences between MapReduce v1 and MapReduce on YARN. The post walks through the various steps in submitting a job in both frameworks, describes the fair and capacity schedulers, compares the two frameworks, and more.

This post describes how to use the haversine formula to calculate the great-circle distance between two points on Earth using Impala/Hive. The formula makes use of trigonometric and algebraic functions, which can be embedded in a SQL query.

KOYA is a project to support running Apache Kafka on Apache Hadoop YARN. Since the project was announced a few months ago, the team has decided to use Apache Slider (incubating) to develop the YARN application. The Hortonworks blog has more details on this decision and plans for the project (which is targeting an initial release in Q2 2015).

The AWS big data blog has a post on using the Accumulo bootstrap action to install Apache Accumulo on a Amazon EMR cluster. The post has a walkthrough that starts a cluster, creates an Accumulo table, inserts and tags data in the table, and illustrates cell-based access controls at query time.

This blog has a post with a quick tip for using Sqoop to dump data from a JDBC database to the local file system. This makes use of a local MapReduce job and the local filesystem implementation.

HiveServer2 began using ZooKeeper for locking in order to support concurrency. This post describes the implementation, how it's being improved to scale to more clients in upcoming releases of Hive, and several failure scenarios that the implementation addresses.

The MapR blog has a post detailing counters in Hadoop MapReduce. It describes the four types of counters: file system, job, framework, and custom. For each, it describes some of the key counters and how they can be interpreted to debug or improve a MapReduce job.

Parsely, makers of analytics software for publishers, have written about "Mage,"s the system that powers their analytics engine. Mage is built on Apache Kafka and Apache Storm and implements the lambda architecture (Apache Spark is used for batch processing). The post describes how data flows through the system, the scale of their system, and more.

While Hue is predominantly a web-based interface for Hadoop, it also includes an API and a command-line interface. This post gives an introduction to the command-line tools, which can be used to update passwords, run tests, and shutdown Hive queries.


O'Reilly Radar has published a new book called "Women in Data" which profiles 15 industry leaders. The interviews share personal stories and also explore a number of topics related to gender-diversity in the big data industry. The eBook is free (behind an email-wall).

Datanami has two posts on Hadoop and high performance computing (HPC, which is often used in science applications). The first looks at some of the shortcomings of Hadoop that keep it from really taking off in HPC. These are things like the network layer (using TCP/REST/RPC), the immaturity of schedulers, and HDFS' semantics and performance. The second post looks at some of the integrations that are driving Hadoop and HPC to converge, including Infiniband, GPU technology, and the cloud.

Cloudera has acquired, makers of tools for doing a meta-analysis of analytics database queries. Their tools can analyze and profile database queries, which are then used to generate optimized schemas for systems like Impala.

DataStax, makers of enterprise software for Cassandra, have acquired Aurelius, who is behind the TitanDB open-sorce project. TitanDB is a graph database, which supports multiple storage backends including Cassandra and HBase and has Hadoop integration for analytics of graph data.

There hasn't been a whole lot of momentum behind Tachyon, the in-memory file system from the AMPlab group at UC Berkeley, from commercial vendors (aside from Pivotal). But BlueData, who makes a platform for Big Data tools like Hadoop, Impala, Hive, and Spark, is investing in supporting Tachyon. Datanami has more about BlueData, the use-cases for Tachyon, and how they plan to integrate it.

Informatica and Hortonworks announced the availability of end-to-end visual data lineage of all operations performed through Informatica. Informatica announced a similar integration with Cloudera last year.

GigaOm has coverage of some news on the Hadoop distribution-front. In short, it sounds like Pivotal will be scaling back its Hadoop development and/or announcing a more formal partnership with Hortonworks or IBM. The news comes after some notable departures at Pivotal and a round of layoffs. An announcement is expected from Pivotal on February 17th.


Last week, MapR announced a number of updates to their distribution. Hadoop, Hue, Flume, Hive, HBase, Impala, and Storm were all updated to new versions.

Apache Hive 1.0.0 was released this week. Previously known as version 0.14.1, the community decided to rebrand it as a 1.0.0 release to reflect the maturity of the project. The Hortonworks blog has a detailed look at history of Hive (the initial release was almost 6 years ago!) while Cloudera has a look at the future of the project.

Cloudera released two new versions of their distribution, CDH, this week. CDH 5.2.3 includes a number of fixes, including important fixes for Avro, HDFS, HBase, Hive, and Impala. CDH 5.3.1 includes fixes to Impala, Hive, and YARN (including fixes for HA).

Version of Apache Kafka was released this week. The new version contains a number of new features and improvements, including a new Java producer API, kafka-based offset management, delete topic support, improved configurability of consistency/availability, support for scala 2.11, and lz4 compression.

Yahoo! is a big user of Kafka--they have one cluster that does over 20Gbps at peak. To manage Kafka, they've built a web-based tool called Kafka Manager. The tool supports managing of multiple clusters, replica election, replica re-assignment, topic creation, and more. It's built with Scala and the Play framework. The code is on github.

Hivemall, the machine learning library for Hive, released version 0.3.0 this week. The new version includes an implementation of matrix factorization.


Curated by Mortar Data ( )



Spark After Dark, by Chris Fregly of Databricks (Santa Monica) - Tuesday, February 10

Python, Sparkling Water, & H2O (Mountain View) - Wednesday, February 11

Couchbase as Operational and Light Analytics to Hadoop (San Diego) - Wednesday, February 11

Do NoSql Like SQL: Introduction to Apache Drill (Woodland Hills) - Wednesday, February 11

Building Real-world Machine Learning Apps with PredictionIO and Spark MLlib (San Francisco) - Thursday, February 12

Deeplearning4j on Spark and Data Science on the JVM with nd4j (San Francisco) - Thursday, February 12


Moneyballing: How to Use Data to Win at Fantasy Football (Portland) - Tuesday, February 10


Better Together: Dato and Spark (Bellevue) - Tuesday, February 10


MapR at Big Data Utah (Salt Lake City) - Wednesday, February 11


HBase/NoSQL Design Patterns (Houston) - Wednesday, February 11

Process & Visualize Data with Hadoop/Hive & Tableau (Addison) - Wednesday, February 11

Getting Started on Hadoop: A Hands-on Experience (Arlington) - Thursday, February 12


Transitioning Compute Models: Hadoop MapReduce to Spark (Chicago) - Thursday, February 12


Netflix, Pig, and Hadoop: Are You Surus? (Chattanooga) - Thursday, February 12

North Carolina

Hive on Spark (Durham) - Tuesday, February 10


From 0 to Streaming: Using Cassandra with Spark (Baltimore) - Wednesday, February 11


Discover the Mesosphere Datacenter Operating System (Paris) - Monday, February 9


Lightning Fast Big Data Analytics with Apache Spark (Edegem) - Wednesday, February 11


Meet Hortonworks (Oslo) - Tuesday, February 10


Apache Hive Workshop (Cluj-Napoca) - Thursday, February 12


Big Data and Product Management at eBay (Tel Aviv-Yafo) - Tuesday, February 10


A Use Case in Hadoop Executed in Apache Spark: Let Us See If It Is 100x Faster (Hyderabad) - Saturday, February 14

Session on MapReduce with Python and Amazon EMR (Pune) - Saturday, February 14


Apache Spark 101 (Melbourne) - Monday, February 9

John Mallory, EMC CTO for Analytics, plus Hadoop 101 with MongoDB Integration (Sydney) - Thursday, February 12