Data Eng Weekly

Hadoop Weekly Issue #78

13 July 2014

This week was fairly low-volume (at least in recent memory), but there are some good technical articles covering Hive, the Kite SDK, Oozie, and more. Also, the videos from HBaseCon were posted, and there were a number of ecosystem project releases.


The Pivotal blog has a post on setting up Pivotal HD, HAWQ (for data warehousing) and GemFire XD (for in-memory data grid) inside of VMs using Vagrant. The four node virtual cluster is setup with a single command, and the blog has more info on the configuration and the tools installed as part of the setup.

Datanami has a post about how Concur, who provides expense reporting software, is implementing Hadoop. They’re running a 40-node CDH cluster and currently using it for classification of expense report items and personalized recommendations. The post is full of anecdotes about their Hadoop rollout that will be useful for anyone in a similar situation.

The Cloudera Kite SDK provides tools and APIs for working with the components of the Hadoop ecosystem. One of these tools is Morphlines, which aims to streamline ETL. This two-part article talks about how to use Morphlines to validate records from a text file and save them into a Hive table. It goes through the Morphlines configuration file options and describes the steps of the process.

The Qubole blog has an article on best practices when working with Apache Hive. It covers how to organize your data on the file system (partitioning and bucketing), choosing serialization formats, configuration parameters to get the most of hive (parallel execution and vectorization), and more.

This post covers PigPen, which is a MapReduce library for Clojure open-sourced by Netflix. It walks through some background on Hadoop, Apache Pig (which serves as the execution engine for PigPen), and Clojure. It also gives a brief introduction to Cascading and related projects (such as pattern, lingual, and drive), and how these compare to the pig-based stack that Netflix uses. Finally, it goes through some examples of PigPen jobs.

In the third part of their series on Apache Oozie, Altiscale has a number of tips for working with the workflow engine. The six tips mostly cover aspects of submitting and running jobs with Oozie.

Hortonworks has curated a list of presentations covering Hadoop operations from the recent Hadoop Summit. Slides and videos for each presentation are available via the Summit archive.

The Cloudera blog has a post on analyzing time-series data with Apache Crunch. The article covers generating Avro-serialized time-series data from Sequence Files (including the event time series avro schema), doing some simple analysis with the Crunch API (e.g. finding min, max, and counts), and doing a cross-join for multivariate analysis. The code for the post is available on github.

The Databricks Cloud was announced at the Spark Summit last week. This post highlights some of the interesting features of the product, including dashboarding and real-time processing. As highlighted in the post, the Databricks Cloud makes it very easy to build products from data.


Recordings of presentations from HBaseCon were posted. There are talks from four tracks—operations, features & internals, ecosystem, and case studies.

The Gartner blog has a post analyzing the rise of Apache Spark, which a number of vendors are jumping to support. It talks about how Spark tends to be easy to integrate (if a Hadoop integration was already done), and also how companies don’t want to be slow to adopt Spark (as many were for Hadoop).

This week, Cloudera announced a partnership with Capgemini and Hortonworks announced a partnership with Accenture. In both agreements, Capgemini and Accenture will help customers deploy their partners Hadoop distribution. A post on SiliconAngle talks about how these types of partnerships show that Hadoop is maturing as an enterprise product.

Actian, makers of the Actian Analytics Platform for SQL on Hadoop, announced a number of partnerships including one with Hortonworks.


InformationWeek has an article on the recently announced DataStax Enterprise 4.5 release. In addition to Spark support, the release has improved supports for joining data between a Cassandra cluster and a Hadoop cluster (DataStax says they don’t aim to solve DataWarehousing and are happy to leave that to Hadoop).

Jumbune is a profiler and debugger for Hadoop MapReduce. It offers per job, per job flow, and cluster-wide analysis tools. It was recently open-sourced under the LGPLv3 license by Impetus Technologies.

Scoobi, the Scala library for building MapReduce jobs, released version 0.8.5 this week. The maintenance release includes a number of improvements and some bug fixes.

Spring for Apache Hadoop 2.0.1 was released. It bumps versions of several dependencies, including Apache Hadoop to 2.4.1.

Version 1.0.0 of Cloudera Oryx, a system for real-time machine learning and predictive analytics, was released. The release contains several new endpoints and bug fixes.

Cloudera Enterprise 5.0.3 was released. There are a number of fixes to the CDH stack, including Flume, HBase, HDFS, Hue, Oozie, YARN, and Solr.

ProtectFile for Hadoop is new enterprise encryption software from SafeNet. ProtectFile offers encryption at rest for HDFS and includes automation tools for deploy.

Pentaho 5.1, which was released in June, added support for Hadoop YARN. It also includes integrations with MongoDB, and has a Data Science Pack which integrates with R and Weka. This post from InformationWeek has many more details on the new release.


Curated by Mortar Data ( )



Cloudera & Lucidworks: SolrCloud Failover, Testing, and Integration with Hadoop (Palo Alto) - Tuesday, July 15

46th Bay Area Hadoop User Group (HUG) Monthly Meetup (Sunnyvale) - Wednesday, July 16

Hadoop Ask Me Anything (Palo Alto) - Wednesday, July 16

OC Big Data Monthly Meetup #3 (Irvine) - Wednesday, July 16

July SF Hadoop Users Meetup (San Francisco) - Wednesday, July 16

Hey Big Data, Meet Apache Spark, by Marco Vasquez of MapR (Santa Monica) - Wednesday, July 16


In-Memory Computing Principles (Denver) - Monday, July 14


Extending Apache Ambari (Houston) - Thursday, July 17

Hadoop and Big R (Irving) - Saturday, July 19


Shawn Hermans Presents Big Data (Omaha) - Thursday, July 17


Apache Cassandra (Saint Louis) - Tuesday, July 15


Deep Learning: Theory, Practice and Predictions with H2O (Chicago) - Wednesday, July 16


Beyond MapReduce: In-Memory Analysis with Spark and Shark (Atlanta) - Tuesday, July 15

North Carolina

Triad Hadoop Users Group (Winston Salem) - Thursday, July 17

New York

Introduction to Apache Mesos (New York) - Monday, July 14

A Leap Forward for SQL on Hadoop (New York) - Monday, July 14


Boston Spark User Group July Presentation Night (Cambridge) - Tuesday, July 15


Technical Workshop - Revolution Analytics and Cloudera (Singapore) - Monday, July 14


Couchdoop and Other Consumer Use Cases from the Hadoop Ecosystem (Munich) - Thursday, July 17


Hadoop 2.0 Processing Framework (Krakow) - Friday, July 18


Hadoop Map-Reduce with Cascading (Hyderabad) - Saturday, July 19

Big Data Meetup (Bangalore) - Saturday, July 19

Hadoop Meetup (Bangalore) - Saturday, July 19