Data Eng Weekly

Hadoop Weekly Issue #108

15 February 2015

One of the themes in this week's newsletter is the changing role of core Hadoop--from marrying Mesos and YARN in project Myriad to the growth of cloud deployments to folks using Spark without Hadoop. Many folks have predicted that 2015 will be the year for Hadoop in the cloud, and it'll be interesting to see what announcements there are along those lines at Strata+Hadoop World in San Jose this week.


Cloudera Manager 5.3 contains a new upgrade wizard that assists in performing rolling upgrades between minor releases. This post describes how to use the upgrade feature, and the types of checks it does as part of the operation.

Pachyderm is an ambitious project to create a "modern hadoop" built on Docker and CoreOS. This post espouses the benefits of the modern approach, such as a built-in job pipeline support and being language agnostic due to its use of an HTTP API for implementing tasks. The post also talks about the Pachyderm File System (which is copy-on-write) and cluster management via Fleet and Etcd.

O'Reilly has a post about Project Myriad, a framework that combines Mesos and YARN. MapR, eBay, and Mesosphere are all working on Myriad, which uses Mesos to launch YARN NodeManagers to elastically control the size of a YARN cluster. The post has a good introduction to the key differences between Mesos and YARN and discusses the importance of security in making the system enterprise ready.

This tutorial describes how to use Oozie and Sqoop to import data from MySQL to HBase. The number of projects involved in the task creates some complexity, so the post includes a troubleshooting section describing several potential problems and their solutions.

This post walks through the steps necessary to install Hue 3.7.1 on HDP 2.2 (version 2.6.1 is bundled with HDP) on Ubuntu 12.04. There are a number of screen shots and thorough instructions, which include a custom build of Hue and manual configuration.

The Cloudera blog has the first part of a series on the recovery mechanisms built into the HDFS write pipeline. It discusses the write path, leases, lease recovery, and block recovery. The write process involves the client, NameNode, and DataNode, and the post describes the role (and possible states) of each of these before, during, and after a recovery. It's a very technical post but is full of useful information for anyone working with HDFS.

Databricks has a post describing several of the main additions and improvements to Spark in 2014 (including Java 8, high-availibility, SparkSQL, and API stability) as well as plans for the future (including a merge of SparkR into Spark, a pluggable data source API, and improvements to Databricks cloud).

Last October, AWS announced the AWS Directory Service, which provides Active Directory in the AWS cloud. Several tools in the Hadoop ecosystem support LDAP authentication (AWS Directory Service is compatible with LDAP). This post focusses on configuring an Amazon EMR cluster using Hue with LDAP authentication. The post describes how to configure AWS Directory Services and how to launch an EMR cluster with Hues LDAP configuration.

The Cloudera blog has a post about Couchdoop, the Couchbase connector for Apache Hadoop. The post shows how to use Couchdoop as a command-line program to import/export data as well as how to use the input and output formats from Java.


Hadoop Summit Europe is April 15-16th in Brussels, Belgium. The Agenda for the conference is now available. Talks cover six tracks and include talks by vendors, folks from industry, committers, and more.

The complete "Learning Spark" book is now available in ebook form, and it will be available in print later this week. A post on the Databricks blog has more details on the content and includes a discount code.

The Register has an interview with Cloudera CSO Mike Olson. The post discusses Cloudera's IPO plans (they have plenty of cash, so don't expect anything too soon), their partnership with Intel, recent acquisitions, and more.

MapR's Ted Dunning and Ellen Friedman have been busy writing a string of short books about large-scale analysis. This post has a Q&A with the authors, in which they discuss their books "Time Series Databases," "A New Look at Anomaly Detection," and "Innovations in Recommendation."

Datadog announced this week that they've acquired Hadoop-as-a-service vendor Mortar. Mortar will be shutting down their service in coming months, but they have prepared a guide to running one's Mortar code without the service. (Mortar has syndicated this newsletter, and they have curated the events section for almost 100 issues. Datadog will continue to contribute the event content.)

Cloudera and Cask, makers of the Cask Data Application Platform (CDAP), have announced a partnership to collaborate on product development. In a blog post about the announcement, Cloudera describes how CDAP helps to make Hadoop simpler to work with by abstracting away and exposing a unified API. If you want to try it out, Cloudera has also posted instructions describing how to deploy CDAP with Cloudera Manager and integrate it with Cloudera Impala.

Hadoop-as-a-Service vendor Qubole has posted a look at their growth during 2014. Their system is processing over 100 petabytes/month (up from 34 petabytes in all of 2013).

The MapR blog has a post about the Unique Identification Authority of India, which uses MapR's distribution for the biometric database. The post describes the scale, security features, enrollment process, and more.

Hortonworks and Hitachi Data Systems announced a partnership this week. Hitachi will resell Hortonworks' platform, and the companies will partner to build a reference configuration that marries Hortonworks' distribution with Hitachi's hardware.

Datanami has an article which reflects on the growing number of Apache Spark integrations that bypass Hadoop. For example, DataStax integrates Spark with Cassandra, Databricks Cloud doesn't use Hadoop, and Spark can integrate with existing RDMBSes (there's a look at the integration with MemSQL). The post also notes that running Spark standalone was the most-popular response from a recent Typesafe survey.

Hortonworks has announced a trio of new certification exams: HDP Certified Developer, HDP Certified Java Developer, and HDP Certified Administrator. The Developer exam is a hands-on, performance-based exam (as opposed to multiple choice).

RelayHealth, a company that performs claims processing, is working to deploy Spark streaming. This post describes several applications for Spark streaming in the healthcare space and the hurdles that RelayHealth has been jumping to deploy it.


On the heels of the Apache Kafka 0.8.2 release, Sematext has announced support for monitoring of Kafka 0.8.2 in their SPM Performance Monitoring product.

Etsy has open-source Sahale, a visualization and analysis tool for Cascading workflows. The tool empowers developers to improve jobs by exposing job metrics, identifying individual MapReduce jobs in a workflow, accessing Hadoop logs, and more. The tool uses mysql for storage, is a node.js web application, and utilizes a scala library to instrument a cascading workflow upon launch. The introductory blog post has many more details and screenshots.

Version 0.18.0 of the Kite SDK was released. The new version contains a tool for importing tar archive files, support for CSV headers, and upgrades of several dependencies.

Syncsort announced a new version of DMX-h, their tool for interfacing Hadoop with mainframes and other legacy big data sources. The new version supports a pluggable execution layer, can export data directly to Avro and Parquet files, and much more.

Luigi v1.0.24 was released this week. It includes a number of fixes and improvements, including a new API for setting parameters, improvements to FTP support, improvements to Redshift support, and much more.


Curated by Datadog ( )



HBase User Group: Strata + Hadoop World Meetup (Santa Clara) - Tuesday, February 17

Big Data Science @Strata Conference (San Jose) - Tuesday, February 17

DataFrames for Large-Scale Data Science (San Jose) - Tuesday, February 17

February SF Hadoop Users Meetup (San Francisco) - Wednesday, February 18

Elasticsearch SV: Strata + Hadoop World Meetup (San Jose) - Wednesday, February 18

February Hive User Group Meetup (Palo Alto) - Thursday, February 19

Analyzing Hadoop Metrics for Security Using Hive (Fremont) - Friday, February 20


Backup and Disaster Recovery in Hadoop (Scottsdale) - Wednesday, February 18


MapR Presenting: Drill Demo! (Salt Lake City) - Wednesday, February 18


Stream Analytics at Trueffect (Fort Collins) - Thursday, February 19


HBase/NoSQL Design Patterns (Houston) - Monday, February 16

Real-time Big Data Analytics with Spark and Solr (Austin) - Wednesday, February 18


Building a Recommendation Engine with Spring & Hadoop (Chicago) - Tuesday, February 17


Tuning Java for Big Data (Independence) - Wednesday, February 18

Initial Kickoff Meeting of Cincinnati Spark Meetup (West Chester) - Wednesday, February 18

Data Integration in Hadoop with SSIS (Mason) - Wednesday, February 18


Spark (Richmond) - Wednesday, February 18

Comparing Splice Machine Hadoop RDBMS with Other SQL-on-Hadoop Tools (Vienna) - Wednesday, February 18

Securing Your Big Data in Hadoop: Strategies & Best Practices (Herndon) - Thursday, February 19


First Meeting! Hadoop Engine Survey, Dev/Test/Prod Strategies (Baltimore) - Thursday, February 19

North Carolina

PySpark, Presented by Tim Hopper (Durham) - Thursday, February 19

Datastax Presenting on Cassandra (Winston Salem) - Thursday, February 19

New York

Get the Most Out of Spark on YARN (New York) - Thursday, February 19

New Hampshire

Apache Spark Streaming with Apache Flume (Manchester) - Tuesday, February 17

ENGLAND A Deep Dive into Apache Cassandra & an Introduction to Apache Spark/Cassandra (Reading) - Thursday, February 19


Parallel in R (Stockholm) - Monday, February 16


Application Architecture with Hadoop (Tel Aviv-Yafo) - Wednesday, February 18


Introduction to Apache Spark: A Practical Approach (Bangalore) - Saturday, February 21