Data Eng Weekly

Hadoop Weekly Issue #68

04 May 2014

There are several articles this week covering deploying Hadoop, including two on integrating Hadoop and Docker. Given how hard it can be to test out Hadoop (let alone deploy to production), it’s always promising to see new tools and systems being used. Videos from Hadoop Summit Amsterdam were posted, and there are several new releases including a Tech Preview of Spark on HDP, and a new version of Impala. Enjoy all of the content to consume and new software to try out!


The Pivotal blog has a post on running the Pivotal HD distribution inside of Docker. By utilizing pre-packaged docker images, it's very simple to get an environment up and running. The tutorial includes setting up MapReduce as well as HAWQ, the SQL-on-Hadoop system from Pivotal. There are some docker containers for other distributions, so it should be possible to adopt this tutorial to other environments.

Another post on getting a Hadoop cluster going quickly, this time using Puppet to provision virtual machines running in Virtualbox using Vagrant. Specifically, this bootstraps 3 VMs with Apache Ambari, at which point you can use the management software to install and configure the Hadoop daemons. If you want to try out Ambari, this is a good way to do so pretty quickly.

Rather than running Hadoop in Docker, this post discusses some upcoming support for running docker containers inside of YARN. Docker supports pre-baked images that can contain libraries and binaries not found on the host, making it possible to run jobs with vastly different sets of dependencies on the same compute node (akin to virtualization, but with much less overhead). The Register has more details on the integration, including interviews with Altiscale CEO Raymie Stata and Hortonwork’s Arun Murthy.

The Sqrrl blog has a post on recent news related to big data security. It coverts HDFS ACLs, Apache Knox, MongoDB 2.6, and Cloudera Search. The post wraps up with details about the security features of Sqrrl Enterprise.

The Cloudera blog has a post on the recently announced python client for Impala, impyla. It contains a walkthrough on the API, including the preview APIs for integrating with scikit-learn and shipping python udfs.

Apache BigTop is a system for building Hadoop ecosystem components into a cohesive unit, which is used to package most Hadoop distributions. This post walks through how BigTop builds RPM packages for each of the components.

A guest post on the Cloudera blog by WibiData engineer Jonathan Natkins describes how to integrate a custom service into Cloudera Manager. The integration relies on a new feature of Cloudera Manager 5 called custom server descriptors. If you’re using Hadoop ecosystem components not supported by Cloudera with CDH, this offers an opportunity to manage them alongside the Hadoop services.

The DataStax blog has an interesting article explaining how they provision and test Cassandra across multiple data centers and 1000 nodes in the cloud.

The Hortonworks blog is doing a series on resilience/high-availability for the YARN Resource Manager (RM). The first phase of this work is implemented, which is a mechanism for persisting the state of the RM to a data store (HDFS and Zookeeper are implemented). Clients must use a new RMProxy library to survive a RM restart.

MortarData has a post about integrating MongoDB and Hadoop. The post includes links to their documentation that describe several strategies for accessing MongoDB data in Hadoop, and there is a video from their CEO describing how to build a recommendation engine with Hadoop and MongoDB.


Videos from Hadoop Summit in Amsterdam in early April have been posted online. The talks cover five tracks, and slides for many of the talks are posted, too.

In a post entitled “Spark on fire,” the DBMS2 blog describes recent Spark news and how companies are deploying Spark. The post notes that Spark 1.0 is expected to be released later this month, and discusses SparkSQL and applications of Spark for machine learning.

Another post on the DBMS2 blog covers Cloudera’s SQL-on-Hadoop positioning. Cloudera supports both Hive and Impala, and it’s not always clear which system should be used for which type of processing (at least in the longer term). It’ll also be interesting to see how Shark and SparkSQL fit into Cloudera’s strategy.

Cloudera and MongoDB have expanded their partnership to include co-marketing and co-selling of each others software. There are also plans to support live-snapshotting of MongoDB data to a Hadoop cluster for analysis.

Pepperdata, makers of Hadoop cluster supervisor and analysis software, have announced a Series A round of financing totaling $5M. They will use the money to grow their team and further product development.

In a third of three posts this week, the DBMS2 blog enumerates the details (and adds some speculation) on the recent Intel investment in Cloudera. It includes some of the short and medium-term goals of the relationship and specifics on the financial transaction.

ComputerWeekly has an article that explores whether Hadoop should complement or replace a data warehouse. It paints a picture of Hortonworks being in the “complement” camp while Cloudera is in the (eventually) “replace” camp. It also includes quotes from Teradata CTO, who doesn’t think that replacing a EDW with Hadoop makes financial sense.

InformationWeek has a story on Datameer’s software, which takes a different approach than other systems. Instead of relying on a SQL-on-Hadoop system to answer queries to power a BI tool, it offers a spreadsheet and visualization tool that operates directly on data in HDFS or another data store.


Hortonworks has announced a Tech Preview of Apache Spark for HDP 2.1. The preview is based on Apache Spark 0.9.1 and Hortonworks has published rpms and debs for installing the software.

Cloudera announced the 1.3.1 release of Impala. The new version includes improvements to memory handling and additional SQL functions.

Apache Tajo 0.8.0 was released. Tajo is a low-latency SQL on Hadoop (as well as additional platforms/data stores) distributed system. The new release includes a number of new SQL features, improved performance and scalability, added support for new storage systems and formats (including Amazon S3 and Parquet), and much more. The Apache blog has full coverage of the new features.

Apache Kafka was released. This is a bug fix release containing 13 fixes, including a fix for a deadlock.

Radoop 2.0 was released this week from the company of the same name. Radoop integrates predictive analytics tools from RapidMiner with Hadoop.


Curated by Mortar Data ( )



HBaseConHackathon (San Francisco) - Tuesday, May 6, 2014


An Introduction to Apache HBase, MapR Tables, and Security (Phoenix) - Wednesday, May 7, 2014


Revenue Management and Hadoop, 'Data Hubs' & the Data Center Transformation (Boulder) - Thursday, May 8, 2014


Advanced Hadoop Based Machine Learning (Austin) - Wednesday, May 7, 2014


Teradata & The Ohio State University to Present (Dublin) - Tuesday, May 6, 2014

District of Columbia

Big Data Week 2014 Meetup (Washington) - Monday, May 5, 2014


2nd Annual Big Data Breakfast (Columbia) - Tuesday, May 6, 2014

New York

Hadoop Developer Day (New York) - Tuesday, May 6, 2014

Bridging the gap, OLTP and Real-Time Analytics in a Big Data World (New York) - Tuesday, May 6, 2014

Apache Spark - Easier and Faster Big Data + Collaborative Filtering (New York) - Wednesday, May 7, 2014

Intermediate Workshop II: Writing MapReduce Applications (New York) - Friday, May 9, 2014


BigDataCloud Mini Conference 2014 (Bangalore) - Tuesday, May 6, 2014

Introduction to Hadoop (Mumbai) - Wednesday, May 7, 2014

Hadoop by example (Hyderabad) - Saturday, May 10, 2014

Bangalore Baby Hadoop Meetup (Bangalore) - Saturday, May 10, 2014


Special Event: Future of Data - Doug Cutting, Founder of Hadoop (Sydney) - Tuesday, May 6, 2014


SQL for Hadoop (Ontario) - Wednesday, May 7