Data Eng Weekly

Hadoop Weekly Issue #119

03 May 2015

There were a couple of noteworthy releases this week—Pivotal HAWQ supporting HDP and a new distributed key-value store (which makes heavy use of Kafka) out of Yahoo. The technical content this week is equally exciting—using Spark for analyzing accelerometer data, a new initiative to speed up Spark (Project Tungsten), and a deep-dive into the restart recover of YARN's NodeManager. Finally, the Apache Parquet project has graduated from the Apache incubator—congrats to everyone involved!


This presentation (and video) provide an introduction to Apache Cassandra and the Cassandra-Spark integration for batch and stream processing (via Spark Streaming).

The Los Angeles HUG recently hosted a presentation on debugging Spark jobs. The talk discusses how to diagnose a Spark error, describes Spark's architecture, describes Spark on YARN, and enumerates several common errors.

This post describes how to analyze accelerometer data with Apache Spark's MLlib. With the goal of classify user activity as walking, sitting, jogging, ascending/descending stairs, or standing, the author describes the data preparation steps, the features extracted from the raw dataset, and using MLlib's decision trees to classify data. The post contains several inline code examples, and the full source is available on github.

This presentation from the recent Lustre User Group conference describes deploying Hadoop atop of the Lustre file system, which is a common scenario in HPC setups. The presenters compare Lustre and HDFS, describe the hardware setup, describe the Hadoop Adapter for Lustre, describe how the LustreFileSystem works, provide some experimental results, and detail lessons learned and best practices.

The morning paper had a two-part look at "Musketeer: all for one, one for all in data processing systems." Part 1 looks at "What's the best data processing system?" (spoiler: there's no single best platform), and part 2 looks at Musketeer, which is an extensible system for choosing the best backend engine for a particular query.

This post summarizes some of the differences between Apache Ignite (incubating) and the Tachyon project, both of which provide an in-memory file system for big data systems like Spark and MapReduce. There seems to be some tension around this comparison, but I found the content interesting (and would be interested in hearing the Tachyon Project's perspective as well).

Project Tungesten is a new initiative to speed up Spark via improvements to memory and CPU efficiency. The focus areas include: Memory Management and Binary Processing (sun.misc.Unsafe, a new HashTable implementation), Cache-aware Computation (cache locality for sorting, joining, etc), and Code Generation (improving serialization, vectorization).

This post describes the (approximate) equivalent Spark APIs for several APIs in MapReduce: combiners, counters, partitioners, and serializers. It's a good overview that contains several useful pointers (such as countByValueApprox() and Spark's KryoSerializer) for anyone working with Spark.

YARN has added the ability for NodeManagers (NMs) to recover their state after a restart. To do this, the NMs use LevelDB on the local file system to save state, including information about the child containers that have spawned. On restart, the NM reads the state from the LevelDB and begins watching any existing containers. The Hortonworks blog has many more details on this feature, including how to enable it and some of the implementation details around security tokens.


This Friday, the Apache Hadoop community is hosting a "bug bash" to help identify and register important patches. A post to common-dev has instructions for anyone who is interested in contributing.

Apache Parquet, the columnar storage format, has graduated from the Apache incubator. Parquet is a popular format that is supported by many processing engines in the Hadoop ecosystem, is compatible with many serialization frameworks (Apache Avro, Apache Thrift, Protobuf, and others), and is in use at a number of internet companies (e.g. Netflix, Twittter, and Stripe).

SDTimes has a in-depth article on the big data ecosystem. It includes interviews with representatives from Concurrent, Cloudera, Hortonworks, NuoDB and more. Topics covered include: building a Hadoop-based data warehouse, the role of SQL in big data, and the origin of data science for big data.

The Cloudera blog has a post about the importance of HBase within the Hadoop ecosystem. The post includes details from several companies, including Cask, Celer Technologies, Pepperdata, and Splice Machine, about how the are using HBase and plan to use it in the future.


Pivotal has announced that the latest release of Pivotal HAWQ, the SQL-on-Hadoop engine, is available to run on the Hortonworks Data Platform.

Pistachio is a distributed key value store that was open-sourced this week by Yahoo. It powers user profile storage, which handles ~2million read/sec and ~.5 million writes/sec. Pistachio is replicated (using Zookeeper to elect a master among m replicas), partitioned, and strongly consistent. All writes go through a master and are persisted to Kafka before being consumed by all replicas to update local storage (which can be in-memory or backed by Kyoto Cabinet or Rocks DB).


Curated by Datadog ( )



Introduction to Apache Ignite (Playa Vista) - Tuesday, May 5

Samza Meetup (Mountain View) - Tuesday, May 5


We Are Going to Introduce the Spark Platform (Addison) - Monday, May 4


Docker Containers in Yarn: Make Your Complex Jobs Play Nice (Champaign) - Monday, May 4


Apache HBase Overview & Bonus Preso (Farmington Hills) - Monday, May 4


Hadoop File System and MapReduce (Duluth) - Tuesday, May 5

District of Columbia

Apache Spark Hands-on Workshop (Washington) - Wednesday, May 6


Getting Started with Spark: How Smart Is Your App? (Cambridge) - Thursday, May 7


Big Data and Apache Hadoop: Just the Basics (Toronto) - Thursday, May 7


Second Apache Spark Meetup: Hands-on (Mexico City) - Saturday, May 9


Hadoop Users Group May 2015 Meetup (London) - Tuesday, May 5

Spark London Meetup @ Strata + Hadoop World (London) - Tuesday, May 5

Big Data London Meetup at Strata + Hadoop World (London) - Wednesday, May 6

Strata + Elastic Meetup (London) - Wednesday, May 6


A Philosophy of Building Data Pipelines (Stockholm) - Monday, May 4


Big Data Analysis with Spark & Cassandra (Koln) - Wednesday, May 6


Use Spark to Solve a Real-World Problem (Herzeliyya) - Monday, May 4


Reliable Real-Time Streaming and Analytics with Kafka + Storm (Hyderabad) - Wednesday, May 6

Real-Time Stream Processing Using Apache Storm (Hyderabad) - Saturday, May 9