Data Eng Weekly

Hadoop Weekly Issue #95

09 November 2014

This week’s issue has great technical content including articles about data infrastructure from small companies, Buffer and Asana, to a large company, Facebook (and their big data challenges). There’s also coverage of a diverse set of topics related to YARN - Kafka on YARN, a comparison of YARN and Mesos, and the YARN timeline server. In industry news, Databricks recent sort benchmarking results have earned a tie for first place in this year’s Daytona GraySort contest.


The Buffer developer blog has a post on how they’ve evolved their analytics data infrastructure from just Mongo and Amazon SQS to also include Hadoop and Redshift. They use Mortar’s Hadoop-as-a-Service to run Pig scripts which load data from Mongo to S3 to Redshift. Luigi, the open-source Hadoop workflow engine from Spotify, is used for orchestration.

Facebook recently posted about several data problems that the company is facing. The look at big data challenges gives you a flavor for Facebook’s data sizes/volumes and internal systems (several powered by Hadoop). Among the problems are those faced by many folks working with big data infrastructure - e.g. how to sample data, which types of compression to use- and some which are unique to large scale companies- e.g. distributing a data warehouse across data centers.

The Cloudera blog has a post on using Spark Streaming for doing near-time session analysis. The post includes an example job which feeds data into HBase to power BI tools via the Hive adapter. The code for this system is available on github, and the post has a detailed look at what the major parts of the example Spark streaming job are doing.

This post looks at the relationship between YARN and mesos. There’s a fairly direct mapping between major components (e.g. YARN ResourceManager ~ meson-master with meta-scheduler), but resource allocation is different in the two systems (Mesos is push-based, YARN is pull-based).

Hortonworks has posted a video, slides, and a Q&A from a recent webinar on the new features and improvements in Hive as part of HDP 2.2. The new features in this version (which includes the first set of deliverables from include support for insert/update/delete and the cost-based optimizer.

This post shows how to deploy the YARN Timeline Server using Apache Ambari blueprints. The timeline server is still a work in progress, but you can get an idea of what types of information it currently supports with the screenshots linked to in the post.

DataTorrent has blogged about a new project to bring Apache Kafka to YARN. The so-called KOYA (Kafka on YARN) project plans to leverage YARN for Kafka broker management, automatic broker recovery, and more. Planned features include a fully-HA application master, sticky allocation of containers (so that a restart can access local data), a web interface for Kafka, and more. The post invites folks in the community to help build KOYA.

O’Reilly Radar has a post on schemas for data. It discusses why it’s tempting to use formats with implicit schemas (e.g. JSON, CSV), the benefits of schema, and why Apache Avro is a good solution. There’s a bit of detail on Avro and its file format, which stores the schema with the data.

The Cloudera blog has a post on the role of HBase in the Hadoop ecosystem. It discusses when it’s more appropriate to use Cloudera Impala (or any MPP engine atop HDFS) vs. HBase. Often times folks end up duplicating the data between systems, which leads to overhead and questions about the source of truth.

Mortar Data has posted a video (and slides) of a presentation by Mayur Rustagi of Sigmoid Analytics on the Pig-on-Spark initiative. The presentation is from the NYC Pig User Group meetup that took place during Strata + Hadoop World.

Asana has written about the evolution of their data infrastructure and the tools that they’re using. Like Buffer, Asana is loading data into Redshift and is using Luigi for managing dependencies. They are also using Elastic MapReduce. The post walks through their philosophy for build data infrastructure—mainly don’t over engineer things from the beginning.

The Cloudera blog has a post about integrating Flume with Kafka. On the Kafka -> Flume side, the integration allows you to deploy Kafka and serialize data to HDFS, HBase, or any other Flume sink without writing any custom code. The integration also supports Flume -> Kafka, in which case a local agent can buffer data. The post also describes upcoming work on a Kafka Channel for Flume.

Amazon recently announced a new Linux AMI version 2014.09. While it’s not yet the default AMI for Elastic MapReduce, it offers a lot of compelling features for building a Hadoop (or other big data) cluster in AWS. Those features come via the 3.14.19 Linux Kernel, which includes improvements for memory management (zram, zcache, zswap), tcp (fast open enabled by default), and btrfs. This post discusses how those improvements might enhance performance of different systems in the hadoop ecosystem.


GridGain, makers of an in-memory "data fabric," have submitted their code to the Apache Incubator. The new project is known as Apache Ignite (incubating). In the announcement, GridGain touts it as a mature in-memory computing platform that can easily integrate with Hadoop.

In a follow-up to the earlier post on sorting 100TB and 1PB with Apache Spark, Databricks announced that their entry to the 2014 Daytona GraySort contest has tied for first place.

MapR and MongoDB have announced that MongoDB connector for Hadoop is certified for the MapR distribution.

Datanami has a report on the state of security for Hadoop. While a number of new projects have cropped up to add authorization, authentication, and encryption to the ecosystem, these are still pretty immature. Commercial add-ons are looking to fill this security gap. Datanami speaks with folks from Dataguise and Zettaset about the state of commercial support.

A trio of LinkedIn veterans who have worked on Apache Kafka and other data infrastructure projects have started a new company called Confluent. They will be focussing on Kafka and realtime data and have publicly committed to continue to work on Kafka (and potentially other tools, too) in open-source. There are more details about the new company in a post on LinkedIn.


Salesforce has introduced the Data Pipelines pilot for running Apache Pig queries against Salesforce data using the Salesforce platform. This post is a brief introduction and tutorial to the system.

Scoobi, the Scala API for MapReduce, has released version 0.9.0. The release includes support for Scala 2.11, improvements to serialization (WireFormats), fixes for EMR/S3, and more.

Plunger is a new open-source tool from for unit testing Cascading pipelines. The github project readme has several code examples of the API. The framework provides a number of utilities for testing (such as pretty printing data and testing serializers).

Amazon has announced support for HUE as part of Elastic MapReduce. It includes first-class support for data stored in S3.


Curated by Mortar Data ( )

WEBCAST Spark + Cassandra: Technical Integration (O’Reilly Media Webcast) - Wednesday, November 12



Diving into Spark Internals + Kafka and akka (San Jose) - Monday, November 10

Cascading: A Java Developer’s Companion to the Hadoop World (San Francisco) - Tuesday, November 11

November SF Hadoop Users Meetup (San Francisco) - Thursday, November 13

#SDBigData Monthly Meetup (San Diego) - Wednesday, November 12

Twofer: Mac Moore of Gridgain & Dale Kim of MapR (Santa Monica) - Wednesday, November 12


Trafodion: Transactional SQL-on-HBase, by Rohit Jain (Houston) - Monday, November 10


Lighting a Spark under Cassandra and Elasticsearch (Boulder) - Tuesday, November 11


Securing Hadoop: What Are Your Options? (Chicago) - Wednesday, November 12


Michigan Hadoop User Group Initial Meetup (Southfield) - Monday, November 10


The Scoop about Hadoop. What Is It? How to Begin? (Harrisburg) - Tuesday, November 11


Join Us for the Kick-Off Meeting at Society of Work (Chattanooga) - Thursday, November 13


Hadoop: A Look under the Hood (West Hartford) - Tuesday, November 11


Big Data: Unconference (Toronto) - Friday, November 14


How Secure Is Your Entire Hadoop Cluster? (Manchester) - Tuesday, November 11

Hadoop, R, Spark, and the Reverend Bayes (London) - Tuesday, November 11

5th Spark London Meetup (London) - Tuesday, November 11


PySpark: Real-time Large-scale Data Processing with Python and Spark (Berlin) - Tuesday, November 11


How Apache Spark Fits in the Big Data Landscape (Stockholm) - Thursday, November 13


BigData and Analytics: Why to Learn Hadoop (Hyderabad) - Wednesday, November 12

What Is Big Data? What Is Data Science? What Is Hadoop? (Hyderabad) - Saturday, November 15

Our First Meetup (Pune) - Saturday, November 15

Apache Spark and the Power of In-memory Computation (Bangalore) - Saturday, November 15