Data Eng Weekly

Hadoop Weekly Issue #58

23 February 2014

I only found a few technical articles this week (please send anything my way if I missed it!), but there were several releases from the Hadoop ecosystem to read up and try out. Of note, HBase 0.98 was released with an impressive set of new features and performance improvements. Elsewhere, the Apache Spark project passed a vote to graduate from the incubator, MapR announced that they’re expanding into Korea, and Intel announced a partnership with Big Data Partnership in Europe.


Hue is a web application for interacting with Hadoop. It’s included with distributions from many of the leading Hadoop vendors. Hue’s authentication layer is pluggable, supporting OpenID, OAuth, LDAP, and more. A post on the Cloudera blog walks through the steps necessary to configure Hue to use LDAP as the authentication backend.

The Hortonworks blog has a post demonstrating how to extract data from Teradata to Hadoop, do some analysis in Hadoop, and import the data back into Teradata. The tutorial uses Apache Oozie to drive the workflow, which consists of Apache Sqoop jobs and Apache Hive jobs. Hue is used to build the Oozie workflow. Most of the post is not specific to Teradata, so it’ll be useful if you’re looking to do something similar with any Sqoop-supported database.

Episode 18 of the All Things Hadoop podcast features an interview with Marcel Kornacker, the architect of Cloudera Impala. Before joining Cloudera, Marcel was a tech lead on Google’s F1 database system so he has a lot of experience building large-scale distributed systems.

DataInformed has an article about how YARN creates the possibility of using new tools and software (including enterprise) on Hadoop. For instance, in addition to removing the MapReduce handcuffs, YARN allows execution of non-Java binaries. The article points out that it’ll probably take a while for many legacy software applications to work well on YARN, but it’s already opening the door for new kinds of systems. In particular, the article demonstrates a simple graphical data flow program to solve the word count problem using software from RedPoint Global.


Last Sunday, a unanimous vote was passed to graduate Spark from the Apache incubator. Apache Spark is a computation framework that’s seen as an evolution of the MapReduce paradigm with enhancements such as in-memory data sets to support interactive querying, native support for Java, Scala, and Python, and a faster backend to Hive.

The number of partnerships and integrations relating to Hadoop continues to grow. This week, Big Data Partnership announced a partnership with Intel to deliver and implement the Intel distribution to European customers. With this and other recent announcements, Intel’s Hadoop products seem to be picking up steam.

Without a doubt, Apache Hadoop would be very different (and likely much less successful) if wasn’t open-source. Hadoop was hatched at Yahoo, and they have contributed a lot to the project. The open-source bet seems to have worked well for Yahoo, and they continue to contribute to the ecosystem even since spinning out many engineers to build Hortonworks. This post acknowledges and thanks the people that made Yahoo’s Hadoop bet successful.

MapR announced that they’re opening a new office in Seoul, South Korea. MapR is seeing lots of demand for their offering in Korea, and they have appointed Daniel Kim to help expand their business in the region. As part of the news, they’ve also announced a partnership with LG CNS to provide consulting services for the MapR distribution in Korea.

Merv Adrian has a follow-up to a recent post on Hadoop security. The key point is that while Hadoop doesn’t have all of the security features expected in an enterprise system, there are lots of companies offering add-on software for Hadoop to provide them. The post walks through some of the companies that offer Hadoop security.


Apache HBase 0.98.0 was released. The new release includes a number of interesting features like reverse scans, cell-level ACLs, encryption at rest, and MapReduce over snapshots. It also includes performance improvements, bug fixes, and api cleanups. The release is wire compatible with HBase 0.96, and direct upgrades are available starting at version 0.94. Slides from the 0.98 release manager, Andrew Purtell, have more details on several of the new features.

Apache Storm (incubating) released version 0.9.1. This is the first release of Storm since it joined the Apache incubator in September. Key changes in this release include a switch from 0MQ to Netty as the default transport, switching to the Maven build system, and improved support for Windows.

The 0.9 release of Apache Mahout was formally announced this week. The MapR blog has an overview of the fixes and new features in the release.

hdp-accumulo is a vagrant configuration for building a local vm running Hortonwork’s HDP 2.0 and Apache Accumulo 1.5.0. The components (such as Zookeeper and Accumulo) are accessible from the host machine, making this a very easy way to bootstrap a local environment for testing or development.

Amazon has announced an integration between its Hadoop-as-a-Service platform, Elastic MapReduce, and its streaming data system, Kinesis. The new connector allows you to process data from the Kinesis 24-hour buffer using MapReduce and other standard tools. For example, you can now use Hive to write SQL queries against windows of the streaming dataset.


Curated by Mortar Data ( )



Big Data is a Big Deal - Running Hadoop, Functional Code, Machine Learning (San Francisco) - Monday, February 24

Extending Hadoop for Fun and Profit with Milind Bhandarkar (San Francisco) - Tuesday, February 25

Etsy's journey: JRuby to Scalding. What happens when a technology chooses you (San Francisco) - Tuesday, February 25

Deploying Predictive Models in Hadoop (Irvine) - Tuesday, February 25

Real-time Hadoop - See the Elephant Fly (Berkeley) - Wednesday, February 26

Architecture Options for Machine Learning (Sunnyvale) - Friday, February 28


Free hands-on Big Data, Hadoop, Cloud and Mobile event at dev@pulse (Las Vegas) - Monday, February 24


Apache Sentry: Enterprise-grade Security for Hadoop (Chicago) - Thursday, February 27


CinJUG - Hadoop (Cincinnati) - Thursday, February 27

North Carolina

2014 Big Data Hadoop Predictions (Charlotte) - Wednesday, February 26

Apache Sentry: Enterprise Security for Hadoop (Durham) - Thursday, February 27


Hadoop Use Cases (Munich) - Tuesday, February 25

SWEDEN SHUG 10. Adding Search to the Hadoop Ecosystem (Stockholm) - Monday, February 24

Graph-parallel machine learning (Stockholm) - Tuesday, February 25

BELGIUM YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing (Liege) - Thursday, February 27


There was a typo in the link for the DataStax and WibiData integration from last week’s newsletter. The correct link is: