Data Eng Weekly

Hadoop Weekly Issue #64

06 April 2014

It was a busy week full of news and releases thanks to Hadoop Summit EU, which took place in Amsterdam last week. Hortonworks, IBM, and Cloudera announced new versions of their distributions, and several ecosystem projects, including Oozie and Tez, had new releases. Tajo, the SQL-on-Hadoop system, graduated from the Apache incubator, and there are a number of technical posts covering many different Hadoop-related topics.


As a YARN application, Apache Tez is easy to deploy (just putting a few jars and config files in HDFS). This post further explores the deployment details including how the setup makes it easy to do rolling upgrades. It also goes through the details of other key design aspects of Tez—failure handling and global optimization.,-DevOps-Edition-(WIP)

Apache Falcon, which is a data processing and management system for Hadoop, includes support for notifications to non-Hadoop systems. This tutorial walks through setting up a Hadoop and Oozie cluster with Falcon, configuring ActiveMQ, and configuring Falcon to route to Camel/Karaf. The post is quite detailed and full of good information.’s data platform includes Kafka and Storm for real-time processing as well as EMR and Pig for batch processing. Slides from a recent talk by CTO Andrew Montalenti have a number of very useful insights from their real-time log processing systems such as details of running a Storm cluster and info on how other folks are using Storm and Kafka.

InfoWorld has done a deep dive on HBase and Cassandra, analyzing things like scalability, reliability, flexibility, and operations. Over the course of three articles, they cover things like the HBase and Cassandra data models and implementations. It includes a list of pros and cons for each as well as a detailed “showdown” between Cassandra and HBase.

Hortonworks announced HDP 2.1 (more details below), and they've published a number of tutorials related to new and improved features. Using the Hortonworks Sandbox (which runs in a local VM), you can try out the new features of Apache Hive (including Tez support), the newly integrated Apache Falcon (for data governance), Apache Storm (for stream processing), and Apache Knox (for perimeter security).

This tutorial covers using Mortar, the Hadoop-as-a-Service system, to analyze data from the Health and Human Services DocGraph dataset. Applications built with Mortar use Pig, and the tutorial includes a number of pig script examples.

The MapR blog has a post for setting up PAM authentication for Hive. PAM auth affects JDBC and ODBC sessions, requiring that they provide a login and password. The post includes some details that are specific to MapR’s setup, but the instructions should be broadly applicable to any Hive deployment.

Episode 20 of the All Things Hadoop podcast was a discussion on YARN and NextGen Hadoop with Bikas Saha and Arun Murthy. They talk about some of the benefits of and improvements in YARN as well as what’s coming in the next couple of Hadoop releases.

Episode 21 of the All Things Hadoop podcast features an interview with HBase PMC members Michael Stack, Lars Hofhansl, and Andrew Purtell. In the podcast, they speak about the highlights of the HBase 0.96 and 0.98 releases, HBaseCon, and the road to version 1.0.

Datasalt has a post on the state and features of various SQL on Hadoop implementations. All in all, the post covers 16 different systems, from open-source to proprietary, from MPP to OLTP. It has a quick blurb on each as well as a table that breaks down support for Batch, Interactive, Point-querying (i.e. sub-second reads to power a front-end), and Operational SQL (read/write support and transactions).


Apache Tajo, the SQL-on-Hadoop data warehousing project, graduated from the Apache incubator. Tajo was started at Korea University and joined the Apache incubator a year ago. Its last release was in November.

Cloudera announced details of their recent round of financing, disclosing that the total investment from their last round was $900 million and Intel’s investment gives them an 18% stake in Cloudera. In an interview on the GigaOm Structure Show podcast, Cloudera CEO Tom Reilly gave more details about the investment, including that a sizable junk (around 40%) of the investment will go to existing shareholders.

MapR and Elasticsearch Inc announced an integration this week. MapR’s search product is based on Lucidworks Solr, but this gives customers another alternative for indexing and searching data stored in a MapR cluster in real-time.

Hortonworks and Cleo announced a partnership to bring Cleo Managed File Transfer (MFT) to HDP. Cleo MFT is a high-performance, scalable, and secure file transfer system that is used in a number of industries.

InfoWorld has an article comparing the licensing approaches taken by Hortonworks and Pivotal, which it describes as “radically different.” On one hand, Hortonworks HDP contains all Apache Licensed code whereas Pivotal code is commercial and requires a per-core license (with some interesting shuffling capabilities). The article goes much deeper into the details of the two approaches.

SiliconAngle also has a post about the duality of Hortonworks and Pivotal. This article covers two main points. First, Pivotal is doing a lot to drive down the price of core Hadoop (this article has some more specifics on pricing than InfoWorld’s). Second, the innovation in the open-source, community driven HDP has shown that it’ll outpace proprietary vendor-driven distributions in the future.

Hadoop Summit Europe was last week in Amsterdam, and the Keynotes from both days have been posted on Youtube. Speakers included folks from Hortonworks, Microsoft, Teradata, Forrester Research, SAP, and more.

Computing has an article about ING bank’s road to Hadoop. They’re moving parts of their DW stack from EMC, IBM, and Oracle to Hortonwork’s HDP 2. The post has some insight on their development and release process, including that they get more inspiration from tech companies than other banks. They also find the velocity of open-source software much faster than the proprietary systems they’re used to using. There’s also some interesting insight into how they filter for malware as bits enter Hadoop.

The Gartner blog has a post on some high-level take-aways from this week’s Hadoop Summit. There are some thoughts on Hadoop 2.0 (don’t bother with Hadoop 1.0 any more), analytics vendors (and how they’re hedging their bets with Hadoop), Hadoop security, the overwhelming number of components in the Hadoop ecosystem, and more.

The O’Reilly blog has a “fun facts” post about HBase. It includes details on many companies using HBase and how they’re using it (emphasizing that it’s widely used outside of advertising).

As part of HDP 2.1, Hortonworks is adding Solr to their distribution. Hortonworks and LucidWorks are teaming up to offer the LucidWorks edition of Solr as the reference architecture. In June, Solr will be available via a sandbox and will be directly integrated into the next release of HDP.


IBM released version 2.1.2 of their BigInsights distribution. The new release is based on new versions of several Hadoop ecosystem components, such as Hadoop 2.2, HBase 0.96, and Hive 0.12. The release also includes a new system called BigR for integrating R with BigInsights to use Hadoop for distributed execution. In addition, it includes a number of enhancements to operational tasks (such as HBase backup) and much more.

Apache Cassandra 1.2.16 was released. This release is a bug fix release for the 1.2.x series containing over 20 fixes.

Hadoop RDMA 0.9.9 was released this week. The project provides high-performance implementations for clusters running with InfiniBand and RoCE (RDMA over Converged Ethernet), which are both widely used in scientific computing clusters. This release is based on Hadoop 1.2.1.

Apache Oozie 4.0.1 was released. The patch-release contains a number of bug fixes and minor improvements (including a bump to Hadoop 2.3 for the hadoop-2 build profile).

Hortonworks has announced HDP 2.1 and a technical preview of the new version of their distribution. HDP 2.1 includes new versions of a number of components, and adds Apache Tez, Storm, Solr, Falcon, and Knox to the growing list. This release also contains the highly anticipated Hive 0.13, which contains improvements from the last phase of the stinger initiative.

CDH 5 and Cloudera Manager 5 were released this week. The CDH release includes updates to nearly every component in the stack as well as the addition of Spark to the core distribution. Cloudera Manager 5 has a number of new features including extensibility of services deployed by CM.

Version 1.4.0 of Parquet MR, the MapReduce support for the Parquet columnar storage format, was released. This fix includes a number of bug fixes, support for protocol buffers, and new file inspection tools.

Scalding 0.9.1 was released. The release contains a number of improvements, including improved join implementation, support for Avro and Parquet, support for Hadoop counters, a new Matrix API, and more. A post on the twitter blog has a detailed overview of the new features and improvements.

Kiji has released the “Ebi” BentoBox 2.0.1. This version brings all components up to Scala 2.10, and KijiExpress is now based on Scalding 0.9.1

Apache Tez 0.4.0-incubating was released. This release included over 75 closed tickets

Cascalog, the clojure library for Hadoop, released version 2.1.0 last week. The new version includes an updated version of the kayo serialization framework, automatic push down of projections, a fix of vector arguments, and much more.!msg/cascalog-user/_X-HDiCYnno/F5mOtAM9Kv0J


Curated by Mortar Data ( )



SF:Accessing External Hadoop Data Sources using Pivotal Xtension Framework (PXF) (San Francisco) - Tuesday, April 8

April SF Hadoop Users Meetup (San Francisco) - Wednesday, April 9

Apache Spark - Making Sense of Big Data Faster and Easier (Palo Alto) - Wednesday, April 9

SoCal Edison's Hadoop Program (Irvine) - Thursday, April 10


Houston Hadoop Meetup Series (Houston) - Wednesday, April 9

Advanced Hadoop Based Machine Learning (Austin) - Wednesday, April 9


Hands-on Hadoop with MapReduce, Hive and Impala (Milwaukee) - Tuesday, April 8


Hadoop Data Warehousing with Hive (Miami) - Tuesday, April 8

New York

Hadoop Workshop I: get started (New York) - Monday, April 7

Interactive Graphics at iHeartRadio using Hadoop, R, and Shiny (New York) - Wednesday, April 9


Vermont Hadoop Meetup (Burlington) - Wednesday, April 9


Luigi - Big data, little boilerplate (Warsaw) - Monday, April 7


Building a Hadoop Warehouse with Impala (Munich) - Tuesday, April 8

April Meetup in Karlsruhe (Karlsruhe) - Thursday, April 10


Workshop by Syncsort: Make your ETL on Hadoop Smarter & Faster (Paris) - Tuesday, April 8


Hadoop & Bedrock use case for the Enterprise (Tel Aviv-Yafo) - Tuesday, April 8


April Hadoop Meetup: Data Warehousing with HBase, Sqoop, and Impala (London) - Thursday, April 10

Data Science Toolbox, Apache Spark for Data Science (London) - Thursday, April 10


Hue presentation by Enrico Berti (Singapore) - Thursday, April 10


Big Data and Hadoop (Delhi) - Sunday, April 13