Data Eng Weekly

Hadoop Weekly Issue #103

11 January 2015

The first full week of 2015 has been fruitful for Hadoop-related content (especially technical). There are a number of great posts, including several from the Data Day Texas conference as well as coverage of Drill, Spark, Storm, and more.


Apache Lens is a relatively new incubator project (originally from InMobi) for providing a unified interface for analytics on data stored in different systems (HDFS, HBase, RDBMS, S3, etc). This post describes the evolution of the data warehouse at inMobi which led to Lens (formerly Grill) and gives an overview of the Lens architecture.

Apache Drill aims to be an interactive query engine for a wide range of data sources. This post looks at all the different data types and ways that you can query data using Drill. For instance, data can be stored in json files, Hive, HBase, and more. Drill exposes interfaces for querying this data using BI tools, an interactive prompt, and via a HTTP REST API.

The Databricks blog has a new post on the ML pipeline API, which was introduced in Spark 1.2 (and is considered experimental). The API aims to help automate tasks that are often done manually as part of building production machine learning pipelines. For example, the API supports feature transformations: appending new columns to an existing dataset.

The MapR blog has a post comparing and contrasting two resource managers for distributed systems: Apache Mesos and Apach Hadoop YARN. The post is available both in video and transcript form. One of the conclusions is that the two systems can be complimentary, to which it refers users to the Myriad framework for scaling YARN clusters on Mesos.–-whiteboard-walkthrough

Spotify has written a highly technical and detailed article about their Apache Kafka and Storm setup, which is used for recommendations, ad targeting, and more. The post describes their software testing strategy, metrics, alerting, hardware (they process 3 billion events per day across 6 nodes), performance tuning, and more.

This post describes the new DockerContainerExecutor for YARN that was introduced in Apache Hadoop 2.6. This feature allows Docker containers to run as YARN containers, which means one could package all system-level dependencies of a YARN job into a docker container for deployment. Not only does the post describe how to use this feature, but it takes it one step further by describing how to use the DockerContainerExecutor when Hadoop itself is running inside of docker containers.

Cloudera has declared HDFS’ transparent encryption as production-ready as part of the CDH 5.3 release. This post discusses the design and features of HDFS encryption, provides some basic examples for using it, and talks about about performance impact.

This post gives an overview of Apache Samza at LinkedIn, where the project was originally built. The post describes the architecture of Samza, including its state storage system and fault tolerance properties. There’s also a case study describing how LinkedIn uses Samza to do call graph assembly for service monitoring.

Hadoop Streaming is a system for processing data using MapReduce for non-JVM languages. This post describes how to use node.js for MapReduce on Amazon’s Elastic MapReduce. The tutorial details how to bootstrap an EMR cluster with node.js installed, write a simple MapReduce job, and deploy a job using the EMR command-line tools.

Spark SQL supports reading data from Hive tables like several SQL-on-Hadoop systems. In addition to that, it can read data stored in Parquet, JSON, and CSV even if the data isn’t part of a Hive table. This features is not something found in most systems—although Apache Drill can do it, too. This post gives a quick intro to the Spark SQL Data Sources API and how to use it via Spark SQL.

TPC Express Benchmark HS is a new TPC benchmark for big data systems. This week, Cisco published some benchmark results of the standard for a 16 node cluster running MapR’s distribution. These are the first results for the new benchmark, so we’ll have to wait for some additional vendors to publish results to see how they stack up.

With the rise of new DSLs and computing frameworks, most folks aren't writing raw-MapReduce jobs using the Java API anymore. This post serves as a good reminder of all the progress that’s been made over the past few years. Specifically, it looks at Scalding and Spark, which offer a rich API for writing big data processing jobs.

Apache Flume is a popular solution for delivering data to HDFS from an application server tier. This post provides a detailed overview of Flume’s macro architecture, the Flume agent architecture, and how to use Flume for the common use-case of log aggregation.

This presentation describes the rise of Python for data science, some of the big data tools that exist today for Python, and provides some suggestions for improving Python’s big data support in the future.

The Kite SDK provides an API for describing, storing, and accessing data in Hadoop. This presentation details what exactly that means—describing the key abstractions, providing examples, and describing the architecture and tools.

As a rather new system, the tools for debugging a Spark workflow are still fairly immature. In addition, the simple interface that Spark’s APIs provide hides a lot of complexity which it can be necessary to understand in order to debug a problem. This presentation looks at several common Spark job failures, and it explains the underlying system mechanics to help understand the root cause.


The O’Reilly Data Show Podcast recently interviewed UC Berkeley Professor and Databricks CEO Ion Stoica about the origins of Mesos, Spark, and Tachyon. This post has a select transcript of the interview, in which you hear about early work on Mesos and Spark. Ion gives a lot of credit to the success of the projects to the students who worked on them.

The call for speakers for HBaseCon 2015 is open until February 6th. Early bird registration is also open, and the conference takes place on May 7th in San Francisco.


Apache Falcon, the data processing and lineage system, recently released version 0.6.0. The Hortonworks blog has an overview of new features of the release, including authorization with ACLs for entities, enhancements to lineage metadata, and archiving to a cloud system such as S3 or Azure.

The folks at SequenceIQ have published a new Docker image supporting Spark 1.2.0. The post has a quick walkthrough of building the docker image, running a container from the image, and launching a spark job.

Pinpoint is a new open-source Application Performance Management tool based on the Google Dapper architecture. Pinpoint uses HBase for data storage.

ASAP is a new stream processing framework. Unlikely other stream processing systems, ASAP focusses on ad hoc querying. ASAP uses Apache Kafka for inter-connecting pipelines.

RecordBreaker is a open-source tool from Cloudera that’s been around for a while, but I’ve just learned about. It’s purpose is to extract structured avro records from text-formatted files.


Curated by Mortar Data ( )



What's Coming for Spark in 2015 (San Francisco) - Tuesday, January 13

Machine Learning for Real-time Bidding on Spark (Santa Monica) - Wednesday, January 14

HBase Meetup @AppDynamics (San Francisco) - Thursday, January 15

HBase+Phoenix Developer Meetup (San Francisco) - Thursday, January 15

Debugging Hive with? Hadoop-in-the-Clou?d by David Chaiken of Altiscale (Los Angeles) - Thursday, January 15


Using Spark to Increase Efficiency in Mobile Marketing at Tune (Seattle) - Wednesday, January 14


Application of Hadoop On-Demand (Tempe) - Wednesday, January 14


Hadoop Lunch at Adobe (Lehi) - Thursday, January 15


Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, January 12


Network Design Considerations and Challenges for Hadoop Big Data Environments (Reston) - Wednesday, January 14

Discuss Migrating Oracle Databases and Apps to Splice Machine Hadoop RDBMS (Chantilly) - Thursday, January 15


HBase Intro and Hands On… Session 1 (Vancouver) - Thursday, January 15


Primera Reunión de Apache Spark (Mexico City) - Thursday, January 15


Hadoop at and (London) - Tuesday, January 13


Conoce Spark Streaming (Madrid) - Thursday, January 15


Big Data and Real-time Analytics with Spark (Bangalore) - Friday, January 16