Data Eng Weekly

Hadoop Weekly Issue #97

23 November 2014

Lots of news out of Europe this week as both StrataConf Barcelona and ApacheCon EU took place. In addition to several interesting presentations from those conferences, this week’s issue contains several articles on Spark and YARN. Also, Stripe has open-sourced four Hadoop tools, and there were releases from open-source projects Apache Pig and Apache Chukwa as well as vendor releases by Splice Machine and HP’s Vertica. One quick editorial note—with the short week in the US for Thanksgiving, I’m going to skip next week’s issue and resume with issue #98 on December 7th.


This tutorial describes the steps needed to run a HBase cluster with Microsoft Azure’s HDInsight, and how to build a simple application to interact with it. It details how to setup a new maven application, configure hbase-site.xml, and building an executable jar for running a test against HBase.

This post looks at a number of key configuration parameters relating to YARN memory usage (which is common area to tweak). It covers configuration for map reduce, the yarn scheduler, and yarn application managers.

This post is a good overview of Spark and how it compares to MapReduce. It describes RDD, describes the computation model (and how it compares to MapReduce), and compares Spark to Cascading/Scalding. The post also explores Spark’s reputation as being in-memory focussed and what it means in terms of performance.

Hadoop’s default deployment doesn’t include any sort of authentication - the system trusts that the client is who it says it is. To enforce authentication, Hadoop has support for Kerberos. This gives a quick intro to the key concepts of Kerberos and how they integrate with Hadoop.

This presentation describes HBase’s upcoming support for richer encodings and data types. It looks at the new OrderedBytes API, the DataType API (for structs, unions, etc), and what’s upcoming. There are also examples of using the new APIs.

Twitter has posted about their system for building an index of the entire corpus of tweets. The system relies heavily on Hadoop, both for storage and for aggregating and preparing (via Pig jobs).

Apache Samza is a stream processing framework built on YARN. Samza was originally open-sourced by LinkedIn, and they have written about operating Samza (and Apache Kafka) at scale. The post describes their deployment system as well as their metric collection and alerting setup.

This presentation explores the state of Hadoop and RDF. Two projects providing the must complete support are Apache Jena project, which has experimental modules for RDF in Hadoop and the Intel Graph Builder, which supports generic graph data via Pig UDFs.

SAMOA: Scalable Advanced Massive Online Analysis is a new system aiming to be the “Mahout of Streaming.” Open-sourced by Yahoo, the project includes support for several backends including Apache Storm, Apache S4, and Apache Samza. The system implements a number of machine learning algorithms, including Distributed Stream Clustering and Vertical Hoeffding Tree Classifier.

A new (alpha) feature of the upcoming Hadoop 2.6 release will be the ability to run Docker containers as part of YARN applications. SequenceIQ has an overview of this feature, and how it can also be used when YARN is already running inside of Docker.

The Altiscale blog has an article about the ubiquity of the Hive metastore. It notes how several new SQL-on-Hadoop systems support it (Impala, Presto, Spark SQL, Drill) and that they do this in order to be compatible with Hive. Both of these turn out to be good things for users—if your data is in HDFS and the Hive metastore, it’s easy to use in many other systems.

The SequenceIQ blog has a post on building a hybrid Hadoop deploy in the cloud by using both a permanent cluster and adding ephemeral clusters as needed. They show how to build clusters with their Cloudbreak (cloud-agnostic provisioning built on docker and Ambari) tool and to use their other tool Periscope to autoscale the ephemeral clusters.


The Call For Abstracts for Hadoop Summit Europe 2015 closes on December 5th. The conference takes place in April in Brussels, Belgium.

MapR, whose distribution is available as an option for Amazon Elastic MapReduce (EMR), has a post with some news on the integration. First, MapR is now supported via EMR on the latest instance families. Second, hourly pricing of the EMR+MapR will be lower for both the M5 and M7 enterprise versions (M3 continues to be free).

MapR and Teradata announced an expanded partnership in which Teradata will provide MapR’s distribution as part of the Teradata Unified Data Architecture.

GigaOm has an article on eHarmony’s infrastructure ambitions around Hadoop and OpenStack. As part of the transition, they’re moving from several instances of a Hadoop appliance to a single YARN cluster as well as bringing up new technologies like Spark and Storm.

Datanami has an article on the rise of Spark. They note that for the first time, Apache Spark has bypassed Apache Hadoop on Google Trends. The post has a look back at the highlights of Spark this year including the growing number of contributors and the recent sort workload results.

Qubole and Microsoft Azure have announced a new strategic relationship in which Qubole’s Big Data-as-a-Service platform is available on Azure.


Splice Machine has announced general availability of their RDBMS built on Hadoop. Version 1.0 includes supports for SQL:2003 including analytics functions, native backup and recovery, and integrates with HCatalog. Splice Machine is positioned as a cheaper alternative than scaling out a traditional RDBMS.

Apache Pig 0.14.0 was released this week. Major highlights of the release include support for a Apache Tez as a backend, OrcStorage, and loader predicate push down. The release supports Hadoop 0.23.x, 1.x, and 2.x.

Stripe has open-sourced four new projects for Hadoop. The projects are: Timberlake, a dashboard for YARN and MRv2, Brushfire, a system for distributing learning of ensemble tree models inspired by Google’s PLANET, Sequins, a static database backed by SequenceFiles, and Herringbone, a tool for working with Parquet files and Impala/Hive.

HP has announced a new version of their Vertica columnar MPP analytics engine that runs on Hadoop. It supports many major distributions, including MapR, Cloudera, Hortonworks, and Apache Hadoop.

Apache Chukwa version 0.6.0 was released. Chukwa is a system for distributed monitoring and analysis from data in log files. The new release adds support for HBase, deprecates the Chukwa collector, and resolves a number of bugs.


Curated by Mortar Data ( )



Stream Processing on Hadoop (Saint Paul) - Monday, November 24


All Models Are Wrong, Some Are Useful (Plano) - Monday, November 24


Options for Streaming Analytics on Azure and Azure Batch (London) - Tuesday, November 25


Hadoop Meetup on Cascading/Tez with Concurrent and Hortonworks (Paris) - Tuesday, November 25


November Meetup (Mannheim) - Monday, November 24

Apache Spark (Hamburg) - Wednesday, November 26


What's New with Apache Spark? An Evening with Paco Nathan (Amsterdam) - Monday, November 24


Offline and Real-time Click Stream Processing (Stockholm) - Wednesday, November 26


Apache Spark: Easier and Faster Big Data (Sydney) - Thursday, November 27


Spark Singapore First Meetup! (Singapore) - Wednesday, November 26