Data Eng Weekly

Hadoop Weekly Issue #144

08 November 2015

After skipping last week, this issue has a lot of content. Notably, there have been a bunch of releases over the past two weeks—Hadoop, Tajo, Phoenix, Slider, Apex, and Storm. In news, Hortonworks announced quarterly results, and there's a new free eBook "Hadoop with Python." Technical content includes tutorials (Apex and Kudu+Impala) and internals (Kafka and Phoenix).


The DataTorrent blog has a tutorial for writing an Apache Apex application in Scala. The tutorial shows how to setup a Maven project, write a LineReader, Parser, and Application, and run the application with dtcli.

The Confluent blog has a post describing how Kafka implements "request purgatory"—tracking requests that haven't yet succeeded or encountered an error. The original implementation uses Java's DelayQueue, which shares performance characteristics with a priority queue. The new design uses Hierarchical Timing Wheels, which offer faster, tunable performance characters. The post describes the implementation in detail and gives an overview of performance benchmarks comparing the old and the new.

Hortonworks has a post describing the components and features of Spark that they've worked on in the past year, and where they're concentrating effort for the future. Past work includes ORC support, an Ambari stack definition for Spark, machine learning library improvements, and documentation updates. Future work includes maturing Apache Zeppelin, an entity disambiguation library, a new Spark + HBase integration, the ability to persist RDDs to HDFS's memory tier, and making Spark streaming more robust.

The recently released Apache Phoenix 4.6 includes support for declaring ROW_TIMESTAMP as part of a table's primary key. BY doing so, the value is stored using HBase's native row timestamp, which provides performance gains. Particularly, when scanning regions with HFiles that haven't been compacted, the ROW_TIMESTAMP information can be used to skip entire files. This is particularly handy when reading recently-written data. The introductory blog post describes the optimization in more details and shows example query response times with this feature enabled and not.

Kudu, the new storage engine from Cloudera, integrates with Impala for SQL access. This post describes how to setup Impala with Kudu (this currently requires a custom build of Impala), how to tell Impala about data stored in Kudu, how to perform various SQL operations (both read and write/update queries), and more.

This post describes the types of RDD persistence available in Spark. The default is memory-only, which is performant but can lead to OutOfMemoryError's. The post has a brief overview of the performance characteristics and trade-offs of several other options.

This tutorial describes how to use Apache Ambari to install and configure the Tachyon FileSystem, which is a memory-centric distributed storage system. The post also has a brief example of using TachyonFS from Spark.

Depending on data sizes and distributions, an inner join in MapReduce can be performed efficiently in a few different ways. This post describes, in a high-level, several of the strategies for implementing an inner-join with MapReduce. For each (e.g. reduce-side, map-side), the post describes some of the relevant Hadoop APIs.

Myriad is a system for running YARN atop of a Mesos cluster. This post looks at how to use Docker's overlay network plugin to isolate YARN clusters (with the ResourceManager and NodeManager running inside of Docker). All clusters share a common distributed file system, which can be accessed via another network bridge. The post has many more details about and code (including Dockerfiles and scripts) for implementing the solution.


Hortonworks announced quarterly results this week. They reported a loss of $0.74/share (adjusted) on $33.1 million in revenue, both of which beat the average analyst estimate (of those surveyed by Zacks Investment Research).

Cask Data, makers of the Cask Data Application Platform for building Apache Hadoop solutions, announced a $20 million Series B round of financing.

The DataBricks blog has a recap of last week's Spark Summit EU. The post highlights and links to the slides for several of the talks from the sessions and keynotes.

"Hadoop with Python" is a new, free eBook from O'Reilly. It covers the Snakebite Python library, the mrjob MapReduce framework, writing Pig UDFs in Python, PySpark, and the Luigi Python workflow scheduler.

MapR announced their best ever quarter of bookings, in which they saw 160% year-over-year increases in bookings and 200% growth in deal size.


Apache Phoenix 4.6, the SQL framework for HBase 0.98, 1.0, and 1.1, was released. The new release includes support for HBase native timestamps, a correlation variable, an alpha-version of a web-app for viewing trace information, and more.

Apache Tajo, the SQL-on-Hadoop data warehousing system, released version 0.11.0. The new release adds support for nested record types, ORC files, Python UDF/UDFA, tablespaces, and multi-queries. The release also includes improved performance for the JDBC drivers, joins, and more.

Apache Hadoop 2.6.2 was released last week. It includes a number of fixes to YARN and MapReduce, which have been backported from the 2.7 and 2.8 lines.

Spark TFOCS (Templates for First-Order Conic Solvers) is a "general purpose optimization package for constructing and solving mathematical objective functions." The introductory post has examples of using TFOCS for solving LASSO linear regression and linear programming problems.

Version 2.9.1 of Apache Curator, the java librariy for Apache ZooKeeper, was released. The version includes several bug fixes and a new recipe for group membership.

Apache Slider 0.81.1-incubating was released. Slider is a framework and application for deploying existing distributed systems on YARN. The new release fixes several bugs and contains a few new features/improvements.

Apache Apex has released its first version, 3.2.0-incubating, since joining the Apache incubator. Apex is a data processing system for streaming and batch, and the new release contains many patches atop of the 3.1.0 release.

Apache Storm 0.10.0 has been released. In beta since June, this major new version adds support for secure multi-tenant deployments, Flux (a new framework for defining storm topologies), an improved logging framework, streaming ingest to Hive, and more.

A maintenance release of the previous major version of Storm was also release. Version 0.9.6 resolves 10 issues.


Curated by Datadog ( )



Apache Kafka and the Rise of the Stream Data Platform (San Francisco) - Tuesday, November 10

Evening with Google Cloud, Distributed DataFrame, and Apache Flink (Mountain View) - Wednesday, November 11

Open Data Platform Initiative Is Now Open for Business: Here's What It Means (Palo Alto) - Thursday, November 12

Best Practices with Airflow: An Open Source Platform for Workflows & Schedules (San Francisco) - Thursday, November 12

Deep Dive on Spark Project Tungsten: Largest Performance Optimizations to Date (San Francisco) - Thursday, November 12


Spark MLlib: From Integration to Production (Seattle) - Wednesday, November 11


Building a Hadoop Data Application (Denver) - Thursday, November 12


Spark Smorgasbord (Mason) - Wednesday, November 11

North Carolina

Conquer Big Data Challenges in Streaming, Security and Data Flow in IoT! (Charlotte) - Tuesday, November 10

District of Columbia

Next Generation Accumulo: Iterator Tutorial and Spark (Washington) - Tuesday, November 10

New York

Real Time Big Data Processing on AWS (New York) - Tuesday, November 10

Rhode Island

Roundtable: AWS Lambda and Kinesis, Experiences and Best Practices (Providence) - Tuesday, November 10


Managing Data in Mesos: Examining Storage Options + How to Build a Data Pipeline (London) - Wednesday, November 11


Cassandra/Kafka & Zeppelin (Paris) - Tuesday, November 10


Stream & Batch Processing with Apache Flink and Event-Time Windowing (Munich) - Wednesday, November 11

Big Data Analytics with Cassandra & Spark (Karlsruhe) - Thursday, November 12

First Spark-Munich Meetup @ "Big Data Munich" (Munich) - Thursday, November 12


We're Talking Azure HDInsight! (Istanbul) - Saturday, November 14


Spark & Dataframes for Hundreds of Multi-Tenant Customers & Billions of Events (Tel-Aviv) - Tuesday, November 10


November Meetup: Hadoop Security (Singapore) - Friday, November 13