Data Eng Weekly

Hadoop Weekly Issue #76

29 June 2014

Google made news this week by proclaiming that MapReduce is dead at Google—there are two reactions in this week’s issue. And with that in mind, there are several good posts covering non-MapReduce projects in the Hadoop ecosystem—Accumulo, HDFS, Storm, Spark, and more. Apache Storm also released a new version this week, and there were announcements from Hortonworks, IBM, and RainStor about their Hadoop-related products.


Apache Accumulo, the distributed key-value store, supports bulk loading of data in its native format, RFile. Loading data as RFiles, which can be generated via MapReduce jobs, is much more efficient than loading the same data one record at a time. The Sqrrl blog talks about some tools they’ve built to load data using RFiles from data stored in JSON and CSV.

A post on the Cloudera blog talks about extended attributes, which are a precursor to encryption at rest and other filesystem features. Extended attributes come in four flavors (user, trusted, system, and security), and they are a mapping of String -> byte[]. The feature is slated for the Hadoop 2.5 release, and there are a number of new HDFS command-line options and api changes to support them.

This post describes how to setup a local core-Hadoop dev environment with IntelliJ. Mostly, the process seems to just work, but there are a few tips to customize the environment and workaround an issue with missing projects on the classpath.

Hortonworks has posted (without a registration-wall) the slides and recording of a recent webinar on Apache Storm. This post has answers to some questions asked during the webinar. They cover how Storm fits together with HBase, Flume, Hadoop, Spark, and more.

Hortonworks also posted a video on their webinar on advanced security on HDP. Again, there is a lot of good information in the Webinar Q&A text included in the write-up. It adds some details about XA Secure (which was recently acquired by Hortonworks), Apache Knox (for perimeter security), the role of active directory/LDAP, encrypting data at rest, and more.

The GigaOm Structure Show podcast features an interview with Databricks co-founder and CTO Matei Zaharia. The interview, for which there are some highlights posted, covers Apache Spark (which Matei is also one of the creators). It covers the genesis of Spark to build a better computing framework, the flexibility and improved programming model of Spark, and more.

This presentation, from the East Bay Java User Group, covers building a Hadoop-based application for clickstream analysis. The talk does a high-level design, which includes things like deduplication and sessionization, data storage with Avro, dataset partitioning in HDFS, and data ingestion with Flume. For each component, there’s a discussion of alternatives (e.g. Flume vs. Kafka) and why a particular alternative was chosen.

Datanami has a case-study of T-Mobile, who recently switched from a petabyte Netezza appliance to Hadoop with RainStor. The post covers T-Mobile’s scaling challenges (they have a 2.5x increase in data every 18 months), the security considerations that T-Mobile addressed (including an isolated network), and their choice of RainStor for SQL and encryption/compression.


This post talks about how we’re in the third wave of Hadoop. According to the article, the first wave was the early adopters that had new types/volumes of data, the second wave created a number of new projects/products and companies offering Hadoop support, and the third is a movement to use Hadoop as a database rather than deploying individual MapReduce jobs.

Hortonworks and IBM announced that IBM InfoSphere Guardium is certified with HDP 2.1. Guardium provides real-time monitoring, alerting, and reporting for audit logging and mitigating data breaches.

The Gartner blog has a post on defining Hadoop. A couple of years ago, the definition was limited to six projects, but now as many as fifteen are supported by commercial distributions. And there are more projects likely to be included in that list as time goes on.

Cloudera, Dell, and Intel announced a new Dell In-Memory Appliance for Cloudera Enterprise. The appliances are optimized for Apache Spark, Apache Solr, and other memory-intensive workflows. Cloudera mentions that more memory is necessary given the push to use Hadoop for real-time analytics rather than batch processing.

This post on Big Data as a Service (BDaaS, the more general version of Hadoop as a Service) tries to answer the question “What are the different types of BDaas available?” It covers Core BDaaS (e.g. Amazon EMR), Performance BDaaS (e.g. Altiscale), Feature BDaaS (e.g. Qubole), and Integrated BDaas.

Continuuity Founder and CEO Jonathan Gray has written a post about Hadoop Summit. He’s identified a few trends—the push towards enterprise support (notably work on security), the balancing act of Hadoop and the traditional EDW, and the fragmentation of Hadoop (vendors supporting different stacks, competing projects, and more). He also mentions that Hadoop needs to be simplified, which is a theme that seems to be gaining traction in some areas (e.g. newer programming models).

Google’s Urs Hölzle made the news this week when he proclaimed “We don’t really use MapReduce anymore” at Google I/O. While many folks were surprised by the announcement, this post explores why it’s not that surprising. Rather, with the research on systems like Dryad (from 2007) and MPP database products, it’s a little surprising that MapReduce is still so prevalent in Hadoop.

In another post triggered by the Google/MapReduce news, the author discusses why MapReduce gained popularity as a processing framework by contrasting it with MPI. With the MapReduce primitives, you can solve a lot of problems. But to solve more complex problems, the Hadoop platform needs something more.


Hortonworks has officially classified Apache Spark as “YARN Ready.” The project is still available as a HDP 2.1 Tech Preview, and Hortonworks has some suggestions for deployment (e.g. multiple Spark deploys on a single YARN cluster if you have many concurrent Spark users).

SequenceIQ has posted a new docker image for Apache Hadoop 2.4 to the official Docker registry. Their post describes how to build the image for yourself, and instructions for some simple testing.

Hortonworks announced that HDP Advanced Security, which is based on the XA Secure acquisition, is now available as an add-on download for HDP 2.1. As part of the announcement, Hortonworks also reiterated their commitment to submitting the software to the Apache Incubator.

structor is a project for building Hadoop VMs using Vagrant. While their are several solutions available for doing so, this setup also includes support for building a secure Hadoop cluster using kerberos.

Apache Storm, the stream processing framework, released version 0.9.2. The new version includes improvements to netty transport and the storm UI, a new Kafka Spout, and more.

hRaven, a tool for collecting metadata about MapReduce jobs, released version 0.9.15. The new version includes updates to the several components, instrumentation of REST API calls, and other improvements.

RainStor, which provides interactive SQL-on-Hadoop, released version 6 this week. The new release, which is certified for Cloudera 5 and Hortonworks HDP 2.1, includes a new archive application, and integration with Apache Ambari and HCatalog.

IBM released IBM InfoSphere BigInsights V3.0 this week. The new release includes a new Big SQL component for low-latency SQL-on-Hadoop (including SQL 2011 support), and an updated version of Solr. The release is available in three editions—quick start, standard, and enterprise.


Curated by Mortar Data ( )



Spark Summit 2014 (San Francisco) - Monday, June 30


Introduction to Pig with Live Demonstration (Tempe) - Wednesday, July 2

North Carolina

NC State and IBM Discussion of Hadoop Usage Patterns (Durham) - Monday, June 30

Washington, District of Columbia

Elasticsearch-DC meetup with a Federal Agency Presenting (Washington, D.C.) - Monday, June 30


Big Data June Meetup (Mannheim) - Monday, June 30


BigData/HadoopSG meetup (Singapore) - Tuesday, July 1


Big Data Mining and Graph Processing (Sydney) - Thursday, July 3


SQL for Hadoop (Zurich) - Thursday, July 3


Big Data Beyond Hadoop: Spark and Message Queuing Systems (Bangalore) - Friday, July 4

Hadoop by Use Case and Example (Hyderabad) - Saturday, July 5