Data Eng Weekly

Hadoop Weekly Issue #54

26 January 2014

Hortonworks announced this week that HDP 2.0 for Windows is GA, which brings YARN to Windows. This week’s issue also contains two articles about Hadoop security—a topic that’s been discussed a lot in recent weeks. Software maintainers were quite busy the past week or so, too—I’ve highlight nine releases in this issue. Overall, there’s a quite a bit of interesting content this week, so enjoy!


When optimizing or debugging Hadoop, it can be really useful to understand the underlying architecture. This post covers HDFS’s block replication and placement policy. After walking through the default placement algorithm, it goes through an example of writing files and inspecting blocks on a cluster spread across three racks. The post covers using hadoop fsck to find block locations, which is a really useful tool for administrating HDFS.

In this post, the author explains his experience getting started with Apache Spark (incubating). After initially being underwhelmed, the author realizes that the power of Spark is hidden by the simple API. The post covers the Spark data model, approach to fault tolerance, Spark streaming, and more.

Hortonworks has published a tutorial on loading data directly from HDFS to Microsoft Excel using Microsoft Power Query for Excel. In the tutorial, data is aggregated via Hive queries and stored in a Excel-compatible text file. Next, data is pulled into Excel via the WebHDFS REST API. The tutorial has screenshots of the entire process, including some impressive visualizations towards the end.

IBM developerWorks has a post on Hadoop security. The post outlines the current state of security for Hadoop. It focuses on the features provided by the Apache Sentry (incubating) project, which provides fine-grained authorization to data accessed via Hive or Cloudera Impala. After that, the article tours some of the other security projects in the ecosystem.

On the heels of the announced of the Google Cloud Storage Connector for Hadoop, the Google Cloud Platform blog has a post touting its performance. Written by guest author Mike Wendt of Accenture Technology Labs, the performance evaluation spans three queries and shows that MapReduce jobs running on data stored in GCS are faster than the same jobs running on data stored in HDFS. It’s unclear if optimal HDFS configuration was used (the full details of the experiment are behind a paywall), but this is a promising result regardless.

Oscar Boykin from Twitter recently spoke on the history, patterns and future of the Scalding library for Hadoop. Scalding is a Scala DSL for Cascading that was started at Twitter in 2011 and open-sourced in early 2012. It’s used by the Summingbird project and the KijiExpress framework. The slides propose some interesting possibilities for the future—including a Spark backend for Scalding.

Cloudera has compiled some resources for building User Defined Functions (UDFs) for Cloudera Impala. The 1.2.x release of Impala supports scalar UDFs and UDAFs but not UDTFs or window functions. In addition to supporting Hive Java UDFs, Impala supports UDFs written in C++. This post walks through the sample code for and process of building a new UDF. It also presents a performance comparison between a Java and C++ UDF.

Java Magazine has an article by Tom E. White, author of Hadoop: The Definitive Guide) on the Kite SDK for Hadoop. The article covers how the Kite SDK integrates with all the various components in the Hadoop stack. It walks through serializing data with Avro, Kite for manage datasets, Flume for ingesting data, Crunch for computation, and querying the data using Cloudera Impala.

Databricks, the company providing commercial support for Apache Spark, has a post on Spark and Hadoop. The post starts by explaining that “Spark is intended to enhance, not replace, the Hadoop stack.” The rest of the post is devoted to explaining the integration—running Spark on Hadoop 1.0 inside of MapReduce, inside of Hadoop 2.0 on YARN, or standalone but integrated with HDFS.

The Wajam blog has a post on their efforts to improve the efficiency of their offline analytics platform. The post describes the pieces and use cases of their analytics platform. In order to improve the runtime speed and ease of constructing queries, they have introduced a preprocessing step to sessionize data around individual user searches.


Hadoop Summit Europe co-host Hortonworks has announced the agenda for Hadoop Summit Europe 2014 which takes place in Amsterdam in April. Talks span five tracks, which are detailed in the post on the Horotonworks blog.

SearchCIO has summarized a recent webinar on big data given by Gartner analysts Merv Adrian and Nick Heudecker. During the webinar, they made some observations about the strengths and promises of Hadoop 2.0/YARN as well as its weaknesses. The article also summarizes their analysis of seven Hadoop distributors. Merv has more about the weaknesses (security) on the Gartner blog.


Twitter Summingbird 0.3.2 was released. Summingbird is a framework for supporting hybrid streaming and batch computation (e.g. online with Apache Storm and offline with MapReduce). This release includes bug fixes and some new features.

Hortonworks Hoya, the project for running Apache HBase and Accumulo on YARN, released version 0.10.1 this week. Key changes of this release include setting the YARN queue, updates to the freeze and exists commands, a package rename in preparation for incubating in Apache, and a mechanism for specifying the JVM heap size of the launched daemons. There are full details on the github release page.

Stratosphere is a distributed computation framework with similar goals and features to Spark. The 0.4 release includes a Scala programming interface (in addition to Java), iterative algorithms, Spargel (a Pregel inspired graph processing API), and much more. It also runs atop of YARN (Apache Hadoop 2.2).

Ovum has a post by Pricipal Analyst Tony Baer on the Intel distribution, which was recently updated. The post covers two of the main updates in this release—improvements to hardware-based encryption (bringing it to HBase, MapReduce, Hive and Pig) as well as a new machine learning library. The release also adds cell-based access control to HBase. Ovum’s post also covers the question of “What is Intel doing in the Hadoop business?"

Rubydoop 1.1.1 is a bug fix release to solve a race condition with the proxy layer.

Avro 1.7.6 was released. This release includes a number of bug fixes, improvements, and bug features. Of note, Avro is OSGi-ready, has gained an XZ codec, and includes new docs for the MapReduce API.

Hortonworks announced that their distribution, HDP 2.0, is now GA for Windows. This brings YARN to Windows Server 2008 R2 and Windows Server 2012 R2. HDP 2.0 for Windows is the entire Hadoop stack, including Hive with phase 2 of the Stinger initiative, HBase 0.96, and more. The first link has more details on the release and the second is a walkthrough on installing HDP on Windows.

HDFS Explorer is a Windows application for browsing HDFS using a Windows explorer-like UI. It uses WebHDFS to access the file system.

MapR has announced that updated releases of several Hadoop ecosystem projects are now certified to run on MapR 3.1.0. The list includes HBase, Oozie, Flume, Hive, and more.


Curated by Mortar Data ( )

Monday, January 27

Talking Identity -- Hadoop and SCIM (San Francisco, CA)

Tuesday, Janaury 28

How LinkedIn uses Scalding for Data Science (San Francisco, CA)

Hadoop 2가 왜 중요한가 (Seoul, South Korea)

Wednesday, January 29

Winter 2014 Seminar Series: Big Data Infrastructure (Tacoma, WA)

Agile Data Science by Russell Jurney (Palo Alto, CA)

Paco Nathan: Data Workflows for Machine Learning (Seattle, WA)

Advanced Hadoop Based Machine Learning (Austin, TX)

Thursday, January 30

DataPotluck X Hadoop and Elastic MapReduce with Q Ethan McCallum (Chicago, IL)

SHUG 9. Dealing with Dirty Data (+) Text analytics in IBM's big data solutions (Stockholm, Sweden)

HBase Meetup @ Apple (San Francisco, CA)

Hadoop (2.2.0) Intro (By Popular Demand) with YARN aka New MapReduce in Focus (Saint Augustine, FL)

Data Evolution on HBase using Kiji by Adam Kunicki of Wibidata (Los Angeles, CA)

Friday, January 31

HANA & Hadoop in Half-time (Palo Alto, CA)

2013 Presentations & Hands-on Labs, see you in 2014! (San Jose, CA)

Saturday, February 1

Bangalore Hadoop Meetup (Bangalore, India)