Data Eng Weekly

Hadoop Weekly Issue #62

23 March 2014

Cloudera and Platfora both reported new rounds of funding this week, and MapR and Jaspersoft as well as Cloudera and Trifacta announced new partnerships. In addition, Pivotal introduced a new version of their distribution, Pivotal HD, and Microsoft announced the general availability of the newest version of HDInsight, which includes Apache Hadoop 2.2. With several interesting technical articles, this week’s newsletter should have something for everyone.


In a tutorial that combines Apache Pig, Cloudera Impala, and Microsoft Power BI, you’ll load a dataset describing on-time performance of flights in the US over the last 30 years. The data describing each flight is joined with carriers, planes, and airports in a Pig job. Next, Pig is used to do simple aggregate analysis. Finally, the tutorial walks through hooking up Microsoft PowerBI to data retrieved through Cloudera Impala in order to do more advanced analysis.

Apache Spark is a new framework for running distributed computations which is gaining a lot of traction. Among the reasons for this, Spark is much easier to write than traditional MapReduce and is easy to pick up especially for developers experienced with functional programming. This post tours the Spark API, which is available in Java, Scala, & Python, makes heavy-use of closures, includes a REPL shell, and more.

The Cloudera blog has a post on using Parquet with several parts of the CDH distribution—Impala, Hive, Pig, and MapReduce. The post gives an overview of the Parquet file format, which provides efficient columnar storage. It then covers reading and writing data stored as Parquet in each of the ecosystem components. The post also has some operational details of configuring Pig and Oozie as well as Parquet interoperability.

The MSDN blog has a post about using Apache Flume with HDInsight, Windows Azure's Hadoop-as-a-Service offering. The post uses the Azure Blob storage as the destination for data flowing through Flume. The Azure storage is accessed via a virtual drive mounted on a Windows machine. The post walks through the Flume configuration for ingesting data from the Twitter firehouse and includes an example MapReduce job for counting tweets by source once in the Azure storage.

In the second post in a series, the MortarData blog has more tips on translating from SQL to Pig Latin. The overview encompasses a number of common queries, such as self-join and OVER… PARTITION BY. Since most developers have at least a bit of exposure to SQL, this is a good reference for getting started with Pig.

insideBIGDATA has a brief recap of last-week’s Los Angeles HUG, which featured a talk by streaming video company Hulu on their data platform. Hulu uses MapReduce, HBase, and Hive as well as a home-grown custom language that translates to Java MapReduce jobs.The summary also includes a link to a video of the presentation.

The Simba blog has a post on a performance regression in HiveServer2 vs. HiveServer (in the server itself, not the data processing). Work to address this regression has been completed and will be part of the Hive 0.13 release. This post speaks about the problem and presents some benchmarks demonstrating the effect and recent improvements.

The Hortonworks blog has a post on configuring Hadoop to use LDAP to determine group membership. It walks through the steps required to enable LDAP, which include modifications to core-site.xml and a restart of the NameNode and YARN Resource Manager. The post concludes with details on the trade-offs on this approach vs. OS-based group mapping (which is not supported on Windows).


Hortonworks shares how they think about Enterprise Hadoop, which encompasses a number of projects in the Hadoop ecosystem. They bucket components into five areas: Data Management, Data Access, Data Governance & Integration, Security, and Operations. The post describes each of these areas in detail and how the various software components fit into them.

A second post on the Hortonworks blog details how many companies adopt Hadoop—from a specialized analytics tool to a full-blown “Data Lake.” The post suggests how to use Hadoop to supplement existing data warehousing tools, and how the new possibilities introduced by a ‘schema-on-read’ make it a destination for all of an enterprise’s data.

Cloudera announced a new $160 million round of financing. The round was led by T. Rowe Price, “three top-tier public market investors,” Google Ventures, and an affiliate of MSD Capital, L.P. Cloudera’s last round was in December 2012, and there has been a lot of speculation that Cloudera would go public in 2014.

MapR and Jaspersoft announced an integration that allows Jaspersoft’s Business Intelligence software to run on the MapR Distribution. The integration includes support for MapR running on Amazon Elastic MapReduce.

Cloudera and Trifacta announced a partnership. Trifacta’s software helps to speed-up the data munging portion of data analysis, and the new partnership includes “joint development, certification, and solution collaboration with customers."

Certified on Spark is a new program from DataBricks, the company offering commercial support for Apache Spark. DataBricks is certifying application built atop of Apache Spark to ensure that they work across distributions, which often have slightly different versions of Spark and the Hadoop stack.

Network World has a post surveying nine of the major players in the Hadoop industry. The article, which is based on details from the Forrester Wave report, includes a short overview of each of the companies with details on customer-base, differentiation, and more.

Platfora, whose analytics platform runs on data stored in Hadoop, has raised a new round of financing totaling $38 million. A post on GigaOm has a good overview of Platfora’s system, which is an intelligence and visualization tool built to handle all types of data stored in Hadoop.

GigagOm’s Structure Data conference took place in New York this week. eWeek has boiled down a lot of the content from day one into five key takeaways. They include a lot about how Hadoop is still difficult to deploy and operate as well as forward-looking thoughts. These include that the big vendors will be challenged by Hadoop, but legacy systems aren’t going away anytime soon. There’s a lot of good content about the present and future of Hadoop as a data platform.


Pivotal HD 2.0 was released. Notable updates include rebasing on Apache Hadoop 2.2, integration with GemFire XD for in-memory real-time data storage and computation, and GraphLab for graph analysis in R, Python, and Java.

Microsoft announced that Windows Azure’s HDInsight supporting Hadoop 2.2 has exited public preview and is generally available. Hadoop 2.2 was in public beta for just over a month, and it supports YARN and the Hive improvements from Phase 2 of the stinger initiative.

Continuuity has open-sourced Loom, their cluster management software. In an introductory post, Continuuity describes the evolution of their cluster software deployment from scripts to chef recipes to the conception of Loom. Loom targets deployment of any type of distributed system from an application server to a thousand node Hadoop cluster running on a cloud provider or a visualization platform. While Loom goes open-source, Continuuity is also selling enterprise support.

Version 0.12.1 of the Cloudera Kite SDK was released. The bug fix release addresses issues with the Hive MetaStore and Crunch jobs running on large datasets.

Apache Phoenix 2.2.3-incubating was released. Phoenix is an SQL query engine built atop of Apache HBase providing JBDC access to HBase tables. The release includes several bug fixes.


Curated by Mortar Data ( )



The First Meetup of Los Angeles Big Data Users Group (Los Angeles) - Wednesday, March 26

How to Avoid the 7 Deadly Misconfigurations When Running Hadoop in Production (Mountain View) - Thursday, March 27

MapR Hadoop Distribution Architecture: Why MapR Did What It Did (San Ramon) - Thursday, March 27


Big Data Hadoop Meetup (Denver) - Tuesday, March 25


Advanced Hadoop Based Machine Learning (Austin) - Wednesday, March 26


Introduction to YARN (St. Paul) - Thursday, March 27


March Big Data Analytics MapR and Core Analytics LLC (Chicago) - Monday, March 24


Big Data with Hadoop (Indianapolis) - Wednesday, March 26


Hadoop Meetup Update (Akron) - Thursday, March 27


Hadoop, the Data Lake, and a New World of Analytics (Atlanta) - Tuesday, March 25

North Carolina

March CHUG (Charlotte) - Wednesday, March 26

HBase Lessons Learned in Production (Durham) - Thursday, March 27

Washington, D.C.

Introducing Hadoop 2.0 on Windows; Hadoop Security (Washington) - Monday, March 24


MongoDB Hadoop Connector (Annapolis Junction) - Tuesday, March 25

Cloudera Invasion: Continuous Integration (Overview + Jenkins) and Impala (Baltimore) - Tuesday, March 25

New York

Storm at Spotify (New York) - Tuesday, March 25

Workshop for Beginners I Getting started with Hadoop (New York) - Thursday, March 27


Apache Spark for Data Science (Amsterdam) - Thursday, March 27


In last week’s newsletter, I said that the Oracle R Advanced Analytics for Hadoop project was free software. It is paid software, but Oracle provides an evaluation version for development.