Data Eng Weekly

Hadoop Weekly Issue #100

21 December 2014

This week’s newsletter marks the milestone of issue 100, which goes out to over 4,000 recipients. The ecosystem has changed a lot in the last 100 issues, which you can see from this week’s coverage of Storm, Spark, Slider, and Kafka. Thanks to everyone that’s contributed great content to this newsletter in the past—please keep it coming!


This post describes how to send metrics from Storm to StatsD & Graphite. The tutorial describes how to integrate the necessary libraries into a build, use the Storm metrics API, and how to access the graphs once data has made it to Graphite.

The AWS big data blog has a post on building a recommendation system using Mortar (disclosure: Mortar syndicates Hadoop Weekly and curates the events section). The tutorial walks through using Mortar’s system locally as well as in a distributed setting on Amazon EMR.

The Hortonworks blog has a recap of a recent webinar discussing new HDFS features such as heterogeneous storage, encryption, and security enhancements. They’ve posted a Q&A from the webinar, which has a lot of good information about how these features work (or will work) in practice.

The Hue blog has a post on setting up Hue for HDP 2.2. It describes the custom configuration and gotchas related to the integration.

Hive-on-Spark has recently made a lot of progress, and there’s a new way to try out the integration. The team from MapR, Intel, IBM, and Cloudera have created a pre-configured Amazon AMI for trying it out.

The High Scalability blog has a post on how Bloomberg is using HBase to serve time-series data. The post describes how they’re replacing a legacy system with HBase, which provides 1000x faster writes and 3x faster reads. With that said, they’re eagerly anticipating the upcoming timeline-consistent standby region servers will help improve read latency in the face of failure in order to meet SLAs.

Cloudera Enterprise is available on the Microsoft Azure Marketplace for deployment in the cloud. This tutorial describes how to launch a CDH cluster in Azure, which brings up cluster with Cloudera Manager for configuration.

This presentation provides an overview of using Spark together with Cassandra. The first part gives an introduction to Spark, the second looks at Spark and Cassandra (including how CQL integrates), and the last part looks at integrating Spark streaming with Cassandra.


MapR and SAS announced an alliance this week. The companies will collaborate on Hadoop integration, customer support, and more.

The Pivotal blog has a post that helps shed some more light on the term ‘data lake.’ They walk through 10 ways that people use a data lake, which uses Hadoop for storage and integrates a MPP database and open-source tools.

HBaseCon 2015 will take place on May 7th in San Francisco. The Call for Papers is now open and early bird registration is through Feb 1st (for $375).

Hortonworks has created new certifications called HDP Operations Ready, HDP Security Ready, and HDP Governance Ready. As part of the announcement, Teradata, Cisco, VMware, and Syncsort have all been certified as HDP Ops Ready.

The MapR blog has a post about the growing community support for Spark. It highlights how many vendors are all collaborating on Spark, and that there has already been significant adoption in community projects. Examples include Apache Crunch, the Kite SDK, Pig on Spark, and preview of Hive on Spark.

Computer Business Review has 10 Hadoop predictions for 2015. They cover both the business-side of Hadoop (growth rate, consolidation) as well as technical considerations like Spark vs. Hadoop and SQL for Hadoop.


Apache Storm 0.9.3 was recently released. Highlights of the release include an improved Kafka integration (for writing data to Kafka), new support for writing data to HDFS, HBase integration, reduced dependency conflicts (via shading), and node.js support.

Scaldual is a new framework for working with Lingual using Scalding. Lingual is a SQL interface extension to cascading, and the project provides APIs for translating between Lingual and Scalding.

spark-kafka is a new integration between Spark and Kafka. While Spark has built-in support for Spark streaming on Kafka, this one uses Kafka as a source for RDDs.

Apache Slider is a framework for deploying distributed applications in YARN—with built-in support for Apache HBase and Apache Accumulo. Version 0.60.0 was recently with support for Kerberos clusters, support for Hadoop on Windows, improved failure handling, integration with Apache Ambari, and more. The Hortonworks blog has more info the release and upcoming plans for Slider.

Cloudera has announced a new Cloudera Labs project SparkOnHBase. The integration, for Scala and Java, supports creating RDD/DStream from a Scan and performing delete/put operations based on the contents of an RDD/DStream. The introductory post provides code examples and describes optimizations built into SparkOnHBase.

Minotaur is a new open-source framework for managing Apache Kafka, Apache Mesos, and CDH clusters in AWS using Cloud Formation. An introductory post describes how the system works and provides examples of usage.

Apache Spark 1.2.0 was released this week. The release notes have a good overview of the major changes and improvements. Among the new features are Scala 2.11 support, a new Python streaming API, a write ahead log for streaming high-availability, and several improvements to MLLib, Spark SQL, and GraphX. Also, to improve performance, bulk transfers have been switched to use netty and a new shuffle mechanism.


Curated by Mortar Data ( )



ICT Applications, Cancer Genomics and Data Sciences (Fremont) - Sunday, December 28


Big Data & eCommerce (Paris) - Tuesday, December 23


Spark Meetup (Hangzhou) - Sunday, December 28