Data Eng Weekly

Hadoop Weekly Issue #138

20 September 2015

Given that I skipped publication last week, there is quite a bit of content to cover. Technical articles cover HBase, Apex, Impala, Spark, YARN, Flink, and more. In news, there are two Hadoop-related podcast episodes. Finally, the Spark 1.5 release was 11 days ago, and Pinterest open-sourced a new key-value store project, Terrapin. With this breadth of content, there should be something for everyone.


The Apache blog has a post about HBase at Bloomberg. The post describes HBase, the types of problems (e.g. serving time series data) that Bloomberg is solving with HBase, describes several performance tweaks and improvements that they've made, and more.

Apache Apex is a new incubator project (as of August) for stream and batch processing of big data. Apex is based on the DataTorrent RTS engine. The DataTorrent blog has an introduction to Apex that covers its architecture and goals. A second post has a tutorial that shows how to build an application with Apex and Malhar, the built-in operator library.

A new paper describes a mechanism for optimizing energy consumption and performance in HDFS by adding support for hybrid storage (SSD and HDD). With their modified Hadoop build, they show up to 20% energy savings and also see improvements in the speed of MapReduce jobs (e.g. by storing temporary shuffle data on SSDs).

Cloudera has published a new benchmark analysis of Impala, which focuses on how the system performs under heavy multi-user load. They show near linear speedup as the size of the cluster grows and graceful degradation as the number of concurrent users increases. The post also describes Impala's Admission Control, which keeps latency low by limiting the number of concurrent queries running on a cluster.

The Platform has an interview with AirBnB's VP of Engineering about their software infrastructure. AirBnB runs in AWS, and there is an interesting section in the interview about their Hadoop and data warehouse deployment. Topics covered include the Airflow workflow engine, how they use Kafka to keep multiple Hadoop clusters in sync, and their separation of Hadoop clusters for ad hoc and business critical workloads.

Spark 1.5, which was released last week (more below), includes several new features for Spark's R bindings. These include improved AWS integration, a new Spark-driven glm method, additional data types, and support for regular expressions.

The MapR blog has an overview of configuring Spark for YARN. It covers the config params for the Application Manager and Spark containers in both yarn-client and yarn-cluster mode.

If you're new to Kafka, this post illustrates the key concepts of Kafka using using famiiliar Unix commands and pipes.

The Flink blog describes how they've optimized performance while adding support for off-heap memory. The post details the motivation and benefits of off-heap memory, the basics of Flink's implementation, and several approaches they used to optimize the implementation.

The Databricks blog has a post on several new features in Spark 1.5: built-in functions (aggregates, collection, date/time, math, and more), time interval literals (e.g. INTERVAL 3 YEAR 3 HOUR), and an experimental user-defined aggregate function interface.

Twitter has written about DistributedLog, their internal replicated log service. DistributedLog is built on Apache BookKeeper, and it has many similarities to Apache Kafka. The post describes these similarities and some of the differences, such as how BookKeeper uses a Memtable for newly-added records. The post details the architecture of DistributedLog including how it complements Manhattan to implement compare-and-set.

Imgur has recently switched from MySQL to HBase for their notifications feature. A blog post describes some of the advantages of the new HBase system, such as support for sparse columns, atomic increments, fast table scans, and linear scalability.


The O'Reilly Data Show Podcast recently interviewed Mike Cafarella, the co-founder of Hadoop and Nutch. In addition to the full audio, the O'Reilly Radar blog has several excerpts from the conversation. Topics covered include the early days of Hadoop, Hadoop's maturation, and Cafarella's current research areas.

The Udemy Industry Insights podcast recently spoke with Ken Krugler, President of Scale Unlimited. The conversation was about big data and Hadoop, and many of the highlights of the interview were extracted to create an "All About Hadoop" infographic.

Cloudera's One Platform Initiative reiterates the company's commitment to Apache Spark. Their goal is to continue to mature Spark, particularly in the areas of security, scale, management and streaming. Security and scale are areas where Spark falls short of MapReduce and are seen as barriers for replacing MapReduce with Spark as the main engine for Hadoop.

"Hadoop and Kerberos: The Madness Beyond the Gate" is now available as a GitBook. The book is currently in pre-release and includes 15 chapters.

The DBMS2 blog has an update on DataStax and Cassandra. It covers the main use cases that DataStax customers are powering with Cassandra, some of the tools that folks are using with Cassandra (e.g. Spark, Storm, Kafka, and Solr), specifics of the Cassandra 2.2 release, and more.


Pydoop is a python library for Hadoop MapReduce and HDFS. Version 1.1.0 adds support for HDP 2.2, performance improvements for avro integration, and more.

Amazon Web Services announced that Amazon Redshift, the hosted data warehousing system, has added support for user defined functions written in Python. This tutorial describes how to write and use a UDF.

Spark 1.5.0, which incorporates over 1400 patches, was released last week. As part of the release, the backend for DataFrames/SQL has been updated to enable code generation, improve join and sort execution, implement native memory management, and more. In addition, Spark's Machine Learning APIs have been updated with new feature transformers and algorithms, Spark Streaming has a new backpressure implementation, and the Direct Kafka API graduated (from being experimental). There are a ton of other features in this release, which are highlighted in the release notes (which also describe some known issues).

Apache Curator, the client library for Apache Zookeeper, released version 2.9.0. The new version fixes several bugs, adds support for container nodes, and more.,2015,Release2.9.0available

Pinterest has open-sourced Terrapin, their read-only key-value service for batch data. Terrapin uses HFiles, Apache Helix, Zookeeper, and provides a java client library. After a year of usage at Pinterest, Terrapin servers 180TB of data across 100 filesets.


Curated by Datadog ( )



Deep Dive: Spark SQL + DataFrames Cassandra Connector Directly from DataStax (Santa Clara) - Monday, September 21

Reactive Streams (San Francisco) - Thursday, September 24


Create Powerful Parallel Processing Solutions with Databricks and AWS Kinesis (Seattle) - Thursday, September 24


Big Data Aggregation from the Ground Up + An Introduction to Apache Streams (Austin) - Monday, September 21


Real-Time Stream Processing with Apache Flink (Chicago) - Tuesday, September 22


Unifying Big Data Batch and Real-Time Streaming with Apache Flink (Milwaukee) - Wednesday, September 23


Spark Double Header (Arlington) - Tuesday, September 22


Spark GraphX and Streaming (Laurel) - Tuesday, September 22


Intro to Scala & Spark (Toronto) - Monday, September 21


Big Data Day (Hamburg) - Tuesday, September 22


Big Data Meetup (Bangalore) - Monday, September 21

Apache Spark Introduction and RDD Basics and Deep Dive (Bangalore) - Saturday, September 26