Data Eng Weekly

Hadoop Weekly Issue #67

27 April 2014

There were a large number of releases announced this week, including Apache Hive, Ambari, and Knox. Hortonworks announced that HDP 2.1 is GA as well as an expanded partnership with Concurrent (makers of Cascading). In addition, there are plenty of good technical posts, including ones covering Apache Spark, MapReduce v2, and running HBase on AWS.


This presentation gives an overview of Hadoop, motivates why traditional MapReduce is hard to write (using inverted index as an example), and gives a tour of Spark, which the presenter suggests will replace it. The tour of Spark includes an overview of the compute model, details on Shark &Spark SQL, and a brief intro to MLLib & GraphX (two Spark libraries).

The Hortonworks blog has a tutorial covering the Cascading SDK. The post walks through the various concepts and primitives in Cascading as well as an implementation of word count using the Java API.

Google recently added an implementation of the Hadoop FIleSystem API for Google Cloud Storage. The MapR blog has a tutorial that explains how to bring up a MapR cluster with support for Google Cloud Storage in Google Compute Engine.

The Cloudera blog has an in-depth article about migrating from MapReduce v1 to MapReduce on YARN. It discusses things like resource allocation in YARN, changes to logging, and changes to concurrency. For someone familiar with MapReduce but just getting started with YARN, this is a valuable resource for understanding key differences.

The MapR blog has a status update on the Apache Drill project. Drill is a system for large-scale interactive SQL on Hadoop. Heading towards a 1.0 release, Drill has recently gained integration with Hive and HBase among many other features.

The MySQL Performance Blog details the process of exporting data from MySQL to HDFS for analysis with Cloudera Impala. The post walks through exporting data to CSV, copying the data into HDFS, and creating an external table in Impala. It then goes into optimizing file formats for query latency, including a few performance numbers from a six-node cluster.

The HubSpot dev blog has a post with a number of tips for running HBase on AWS. The post discusses running on c1.xlarge instances, which only have 7GB of RAM. Tips include tuning regions per RegionServer, optimizing memory (and MSLAB), how to best use caching and batching, and controlling load from MapReduce.

The Hortonworks blog has a detailed post on installing and configuring a Hadoop cluster on Windows. It uses an MSI and a .NET console application to install across the cluster, uninstall, and add/remove nodes. The code for the project is available on Github.

The fourth post in a series on building a Data Lake with the Pivotal stack details setting up a Pivotal HD cluster with HDFS integration. Full of screenshots, the post walks through using the Pivotal management software to start and configure the nodes in the cluster.

The MSDN blog has some examples of using the HCat API to retrieve details on jobs running in the cluster. The post includes scripts to do so written in both PowerShell and node.js.


Hortonworks and Concurrent announced an expanded partnership. As part of the agreement, Hortonworks will deliver Cascading with HDP, and an upcoming release of Cascading will support Apache Tez.


Apache Hive 0.13 was released this week. The new release contains improvements to speed, scale, SQL support, and more. Specifically (and covered much more in-depth on the Hortonworks blog), the release includes a new cost-based optimizer, a faster query planner, subquery support for IN and NOT IN, and improvements to HiveServer 2 (SSL encryption, PAM authentication, and more).

Apache Ambari 1.4.1, the latest version of the Hadoop cluster management software, was released this week. The Hortonworks blog has details on the new release, which includes features like maintenance mode, rolling restarts, bulk host operations, and decommissioning of nodes.

Impyla is a python library for Cloudera Impala. It includes tools for writing Impala UDFs in python using Numba. Impyla offers integration with pandas and MADlib.

The Hortonworks blog has a post on the recently released Apache Knox version 0.4.0. Knox provides a secure gateway to a Hadoop cluster via a REST API. This release includes enhancements such as extended integration with Apache Shiro for determining group membership and an audit log of all gateway activity.

Kite SDK (formerly Cloudera Development Kit) released version 0.13.0 this week. The new version includes a new command-line interface with tools for converting csv data to avro and CRD operations for Kite datasets. There are also a number of updates to the morphlines library.

On the heels of the Apache Hive, Tez, and Knox releases, Hortonworks announced general availability of version 2.1 of the Hortonworks Data Platform (HDP). In addition to those components, the release includes support for Apache Accumulo, Phoenix, Storm, Solr, Falcon, and the Cascading SDK.

Oracle’s Big Data Appliance 3.0 shipped this week. It includes Cloudera's CDH 5.0 pre-installed and configured, Apache Sentry pre-configured, and support for Apache Spark.

Apache Gora 0.4 was released. Gora is a framework for in-memory data modeling for big-data, supporting a wide range of data stores like column stores, document stores, RDBMSs and more. The Gora project recently announced that it was used as the data persistence abstraction used by Apache Giraph.

Parquet 1.4.2 was released this week. It includes a number of bug fixes and improvements, including a better strategy for generating splits.

Phantom is an asynchronous type-safe Scala DSL for Cassandra. It supports data modeling, querying, automated schema generation, time series, composite keys, secondary indexes, and more. Phantom was developed and open-sourced by newzly.

Apache Twill 0.2.0-incubating was released this week. The new release includes a number of bug fixes, improvements, and new features. Of note, it adds support for Hadoop 2.3.0 and a new TwillRunnable that runs bundled jars without a need to worry about dependency conflicts with Twill itself.


Curated by Mortar Data ( )



SF: Next Generation Hadoop Architecture with Roman Shaposhnik (San Francisco) - Tuesday, April 29

Unsupervised Learning and Multinomial Logistic Regression with Apache Spark (San Francisco) - Thursday, May 1

Apache Tez - A Modern Processing Engine for Hadoop 2 (Santa Clara) - Tuesday, April 29

Spark - The steroid for Hadoop (San Ramon) - Wednesday, April 30

SoCal Edison's Hadoop Program (Irvine) - Thursday, May 1

Learn how to secure, govern and explore Big Data in Hadoop (Mountain View) - Thursday, May 1


How to Stop Worrying and Start Modeling Big Data with Better Algorithms and H2O (Houston) - Monday, April 28

Advanced Hadoop Based Machine Learning (Austin) - Wednesday, April 30


Big Data & Analytics Developer Day (Indianapolis) - Wednesday, April 30


Impala - Straight from the Antelope's Mouth (Philadelphia) - Tuesday, April 29

North Carolina

April CHUG: Moving Customer Analytics to Hadoop (Charlotte) - Wednesday, April 30

New Jersey

Secure Ingestion to Visualization - Dataguise - Architecture and Demo (Flemington) - Tuesday, April 29

New York

NYC Spark Users/Potentials Meet & Greet (New York) - Wednesday, April 30

Intermediate Workshop I: Integrate R with Hadoop (New York) - Thursday, May 1


Toronto Hadoop User Group Monthly Meetup (Toronto) - Wednesday, April 30

Mississauga Big Data Analytics Meetup 6 in Mississauga (Mississauga) - Sunday, May 4


Workshop by Syncsort: Make your ETL on Hadoop Smarter & Faster (Paris) - Tuesday, April 29


Exist global talking about their big data project (Perth) - Wednesday, April 3