Data Eng Weekly

Hadoop Weekly Issue #39

13 October 2013

Cascading, the CDK, and Summingbird all had releases this week, and Apache Drill had its first milestone release. This week's newsletter also features articles about integrating HUE with Sqoop and Hive with HBase. And among many other great articles, there's a really interesting one detailing the design of a new Hive extension that will support updates, deletes, and inserts into Hive tables. Enjoy!


The Pivotal blog has a walkthrough of using Flume, HDFS, and HAWQ (their SQL-on-Hadoop implementation) to analyze twitter data. The post covers setting up flume to ingest tweets, creating an external table in HAWS using PXF (the Pivotal Extension Framework), and using PXF's JsonResolver to parse the tweets at query-time. In addition, there's an example of creating a view on top of the table, and it includes several example queries over the dataset.

g33ktalk has posted the video of Eric Sammer's talk on the Cloudera Developer Kit (CDK) at the Big Data Gurus meetup in September. The talk covers the motivation for CDK, the CDK data module library, morphlines (for ETL), and the future of the CDK.

As anyone who has worked with MapReduce or JVM code for a Hadoop framework knows, it's really easy to have a classpath conflict with Hadoop. Since Hadoop's dependencies tend to be older, stable versions of libraries, it's easy to create runtime errors by adding a newer version of a hadoop-dependency library. The Kiji blog has an overview of how to configure Hadoop (there are a few places and options) to put user code jars first on the classpath in order to resolve these kinds of issues.

In the fifth post in their series on Apache Tez, the Hortonworks blog covers Tez's support for "dynamic graph reconfiguration." In particular, the post covers two use-cases that offer a lot of opportunity for optimization: adjusting the number of reduce tasks and determining when to begin slow-start reduce tasks (reduce tasks that start before all mappers are finished). The post also covers the key parts of the Tez API that make dynamic graph reconfiguration possible.

Pig's cross join usually produces too much data for all but the smallest of datasets, so it's usually not a good idea to use it. The Mortar blog has an example of when cross join is a poor choice, and they show how to compute the same output with a normal join.

DataStax Enterprise supports MapReduce (and frameworks built atop MapReduce) on data stored on Cassandra via the Cassandra File System. The DataStax blog has some tips on tuning a Cassandra cluster for MapReduce. The recommendations cover both the Cassandra and MapReduce parts of the system.

HUE, the UI for applications in the Hadoop system, has recently added support for importing data from a relational database to HDFS (and vice versa). It uses Sqoop 2 for this, which exposes a JSON/REST API for launching jobs. In this post, the HUE blog walks through importing data from a MySQL database to HDFS.

A new project to add support for updates, inserts, and transactions to Hive is covered on the Hortonworks blog. The proposed implementation uses the Hive Metastore to maintain transaction ids, writes delta files within Hive partitions on HDFS using ORC files, and supports two types of compaction to prevent too many files from accumulating in Hive data directories. The post covers all of these decisions in detail and also talks about why it's not implemented on top of HBase.

The IBM developerWorks blog has an in-depth post covering Hive and HBase. The post starts with an overview of Hive and HBase, shows examples of loading CSV data into Hive and HBase, and finally covers how to create a Hive table for a HBase table. It also concludes with an overview of when HBase or Hive is likely the right solution for various use-cases.

The Gilt Groupe tech blog presents experimental results comparing Teradata Aster and Apache Hive 0.11 (running in Hortonwork's HDP 2.0). The experiment compares an 8-node Aster cluster and a 9-node Hadoop cluster on 900 million rows of data (in the 100GB range). The post concludes that Hive performs pretty well for many queries, but Aster is better for others and more stable.

Congratulations to the authors of "Apache Hadoop YARN: Yet Another Resource Negotiator" for winning best paper at the 2013 ACM Symposium on Cloud Computing. The paper is co-authored by 16 folks from Hortonworks, Microsoft, InMobi, Yahoo, and Facebook.


The Cloudera Development Kit, CDK, reached version 0.8.0 this week. The release contains a new Dataset Repository URI feature, much wider support for exporting internal metrics (to JMX, slf4j, http, and csv), and upgrades to several dependencies (including parquet which was updated to version 1.2).

Cascading 2.2 was released. It has some new features to help improve handling of small files, more generalized aggregator support via the Comparable interface, and more. It also contains improvements to resource usage during filesystem operations, to dynamic classpath management, and more.

The Cascading SDK 2.2 was also released this week. It includes Lingual, the SQL engine for Hadoop, the Clojure DSL for cascading "Cascalog," and the Scala library for cascading "Scalding." In addition, there is a vagrant-based deploy of Hadoop with Cascading 2.2 pre-installed available for testing.

Summingbird 0.2.2 (version "Multi-sum madness") was released. The new version contains a number of incremental improvements, as well as some new features such as streaming left join and a flatMapKeys function. Details of all 13 resolved issues on are the release page.

htuple is a new library from Alex Holmes, author of "Hadoop in Practice," to help ease the burden of secondary sorting in vanilla Java MapReduce. It uses a Tuple and ShuffleUtil builder API to support generic secondary sorting.

The first milestone release (1.0.0-m1) of Apache Drill was released this week. Apache Drill is a low-latency SQL-on-Hadoop project based upon Google's Dremel. Details of the functionality and features in the release will likely be covered by some of the contributors soon (I didn't have any luck uncovering a good resource), but it's encouraging to see such a young project making its first release.


Curated by Mortar Data ( )

Tuesday, October 15

Minneapolis Area Lucene/Solr Meetup (Minneapolis, MN)

Solr + Hadoop = Big Data Search (Durham, NC)

Panel: Making Sense of Big Data (New York, NY)

MapReduce 101 (São Paulo, Brazil)

Wednesday, October 16

40th Bay Area Hadop User Group (HUG) Monthly Meetup (Sunnyvale, CA)

Big Data Use Cases in High Tech (Mountain View, CA)

Hadoop Users Pro Group Pittsburgh October Meetup (Pittsburgh, PA)

The Art and Craft of Big Data Analytics (Denver, CO)

Hadoop Adventures at Spotify (+) Hue – the open source Apache Hadoop UI (Stockholm, Sweden)

Impala: A Modern, Open-Source SQL Engine for Hadoop (Dallas, TX)

Data Lessons Learned at Scale (Washington, DC)

Hadoop Based Machine Learning Course (Austin, TX)

Introduction to Digital Signal Processing in Hadoop (New York, NY)

Thursday, October 17

Hadoop Ecosystem (Munich, Germany)

St. Louis Hadoop Users Group Meetup (St. Louis, MI)