Data Eng Weekly

Hadoop Weekly Issue #83

17 August 2014

The big news this week was the Apache Hadoop 2.5.0 release. There are also a number of interesting technical articles covering the Apache Hadoop HDFS, Apache Drill, and several other ecosystem projects. Also, there's an interesting post on profiling MapReduce jobs (which is typically quite challenging) with Reimann.


The Cloudera blog has a post on the motivation and design for HDFS caching, which was implemented as part of the Apache Hadoop 2.3.0 release. Cloudera recommends its use in CH 5.1 to speed up Impala and other applications. Data is stored in cache by sending a cache directive to the NameNode, which keeps track of which files are cached where. This design allows applications to take advantage of locality of cached data (and enable zero-copy reads).

MapR is one of the biggest proponents of Apache Drill, so it’s interesting to hear their take on the recently 0.4.0 developer preview. This post talks about Drill’s agility (it can run queries directly over datasets without the need for a metastore), flexibility (its internal data model is JSON-like allow for nested data types), and familiarity (the query language is SQL). MapR also has pre-configured packages of Drill for their distribution.

IPython notebooks are a popular tool for data scientist, particularly when sharing data exploration tooling. Given that Spark has a Python API, it’s a natural (and powerful) idea to marry the two for data exploration and analysis. The Cloudera blog has a detailed tutorial on setting up IPython, pyspark, and a simple IPython notebook to interact with a spark cluster. There is some example code on github and the IPython viewer.

Several months back, the Apache Mahout community announced a migration from MapReduce to Spark for the backend of core algorithms. In addition, they’re developing a Scala DSL for representing data transformations. This post looks at the Scala DSL and the rewritten (for Spark) item-based recommendation system. It also describes the command-line tool that can be used to run this system against data stored in text-delimited files.

Profiling distributed systems can be a complicated task. It’s particularly hard for MapReduce jobs where there is often a mix of user-code, library code (e.g. Hive, Cascading), and framework code. This post describes how factual uses Reimann to profile Hadoop jobs. It describes the system’s profiling strategy and how results collected at a central location for analysis. The post also describes several performance issues that the system helped to uncover and resolve.

This post on the Hortonworks blog describes how to use Apache Knox as a secure gateway to HiveServer2. It’s a fairly complicated setup (Hive client -> JDBC over HTTPS -> Knox -> HTTP -> HiveServer2), but it can be used to achieve perimeter security for a Hadoop cluster (Knox can authenticate users). The post shows how to configure Hive with Apache Ambari and the required connection strings for Knox and the Hive client (beeline). There’s also a section on configuring another client, Simba, over ODBC.

This presentation, recently given at the Chicago Hadoop User Group, describes the Drill data model/architecture (namely, schema “on-the-fly”), the Drill execution engine (which does runtime byte-code generation/compilation), and a Drill demo. The video of presentation is available on vimeo at the second link below.

This post describes an end-to-end solution for building a recommendation engine using Apache Spark’s MLlib. The system uses MLlib’s alternating least squares algorithm to build up predictions for each user of the website, which are stored in MongoDB. It features an application built with the Play framework to serve recommendations. The code for the project is on github.

Apache Spark streaming and Apache Storm are often mentioned as tools solving similar problems. But this presentation makes the observation/point that Spark streaming is a (micro) batch processing framework while Storm is a stream processing framework. Trident, the abstraction atop of Storm, is more comparable to Spark Streaming. The rest of the presentation focusses on comparing Trident and Spark streaming, including considerations for fault tolerance and reliability.

The Tachyon project is trying to solve a similar problem to the HDFS file caching solution described in an earlier post. It takes a different approach, though, by implementing an in-memory FileSystem that also supports writing through to persistent storage on HDFS (or S3 or anything implementing the FileSystem API). This post has several more details about the project, which is currently in an early release.


The Qubole blog has a post summarizing a number of recent announcements in the Hadoop ecosystem. It focusses on the business and enterprise side of the Hadoop news in more depth than this newsletter typically does.

Hortonworks announced that the code of the Hadoop security offering from XA Secure (which Hortonworks recently acquired) was submitted to the Apache incubator as the Argus podling. The post describes the project charter and invites developers to help build a community around the project.

ScaleOut hServer is a drop-in replacement for the Hadoop MapReduce engine that executes on data stored in-memory. ScaleOut announced this week that they’ve attained Hortonworks Certification.

A lot of marketing and news coverage of Hadoop surrounds tech companies in the bay area and New York. This article takes a look at other areas where Hadoop and big data are having major impact—the agriculture, insurance, and automative industries.

Splice Machine, makers of a RDBMS backed by Apache HBase and Apache Derby, recently announced a $18M round of funding. This article has an interview with their CEO during which he explains more about their business plan and target customers. Rather than competing with existing Hadoop vendors, they’re hoping to grab users of Oracle, IBM, or other enterprise RDBMS products.

Hadoop is a relatively young software project, and it’s lacking a number of important features. This article discusses some of those key features (e.g. security and ease of operation) and points out that folks are using Hadoop anyway. The conclusion seems to be that Hadoop is often used as a supplement to existing systems, so folks are willing to use it even given its warts.


Apache Hadoop 2.5.0 was released. The new version includes updates to HDFS (extended file attributes, an improved web UI) and improvements for YARN (better REST API support and security for the application timeline server). The release also contains a large number of improvements (including to documentation) and bug fixes.

MapR has announced support for new versions of AsyncHBase, HBase, Hive, Flume, and Oozie for their distribution. Flume is seeing the largest update, going from Flume 1.4 to 1.5 (which includes a disk-spillable channel and more).

Apache Sqoop 1.4.5 was released. The new version adds support for Apache Accumulo and a new high-performance Oracle connector. There are also a large number of bug fixes and improvements (covering HBase, Avro, Amazon S3, and MySQL support).

Mortar (full disclosure: they help with this newsletter and syndicate Hadoop Weekly) have open-sourced their StoreFunc for DynamoDB. The so-called DynamoDBStorage UDF allows for efficiently writing data to DynamoDB as part of a Pig job. It is customizable in its write throughput and retry behavior.


Curated by Mortar Data ( )



Escape From Hadoop: Spark One-Liners for C* Ops (Milpitas) - Tuesday, August 19

OC Big Data Monthly Meetup #4 (Irvine) - Wednesday, August 20

Bay Area Hadoop User Group Monthly Meetup (Sunnyvale) - Wednesday, August 20

Network Design Challenges for Hadoop Environments (San Francisco) - Wednesday, August 20


Boise BI User Group Summer Session (Boise) - Thursday, August 21


Hadoop Lunch at Adobe - Competition Rules/Details (Lehi) - Thursday, August 21


Genomic Sequencing & Hadoop (Scottsdale) - Tuesday, August 19

A Detailed Look at Big R: R + IBM InfoSphere BigInsights (Scottsdale) - Wednesday, August 20


Getting Jiggy with Change Data Capture and Slowly Changing Dimensions (Boulder) - Wednesday, August 20


Apache Drill: Building Highly Flexible, High Performance Query Engines (Omaha) - Thursday, August 21


Apache Samza: LinkedIn's Real-Time Stream Processing Framework (Austin) - Wednesday, August 20

3rd Thursday Huddle! (Dallas) - Thursday, August 21


Hybrid BI Solutions with Hadoop and Microsoft Toolsets (Oak Brook) - Thursday, August 21

What's New with Apache Spark? An Evening with Paco Nathan (Chicago) - Thursday, August 21


Building a Fully Functional Hadoop Cluster in 1 Hour for Less Than $1 (Richmond) - Tuesday, August 19

North Carolina

Triad Hadoop Users Group (Winston Salem) - Thursday, August 21


HUG Pittsburgh August Meeting (Pittsburgh) - Wednesday, August 20

SQL on Hadoop (Philadelphia) - Wednesday, August 20


Real-World Hadoop Applications, Built in Bucharest (Bucharest) - Thursday, August 21


Hadoop Meetup (Bangalore) - Saturday, August 23