Data Eng Weekly

Hadoop Weekly Issue #125

14 June 2015

Hadoop Summit was this week in San Jose, and there were many new releases and partnerships announced. This week's newsletter contains many of these as well as slides from several presentations at the conference (please send along other ones I missed!). Among the highlights, Spark 1.4, MapR 5.0, and HDP 2.3 were released, Teradata is adding support for Presto, and details on how the Hammerlab is integrating bioinformatics software with Hadoop.


A future version of Apache HBase will have improved support for storing cells sized over 10K—so-called Moderate Objects (MOBs). This support is part of CDH 5.4, and this post describes the architecture of the solution. In short, MOBs are sent to a special HFile called a MOB file, which is part of a special offline region, when flushed. More details of the read and write path implementation can be found in the post.

Apache Spark 1.4, which was released this week (more below) includes alpha-level support for accessing Spark from R. Using the SparkR integration, it's possible to get a Spark-backed DataFrame with an API similar to dplyr.

HDP 2.3 (more details below) supports transparent data encryption at rest. For this, it uses Apache Ranger as a Key Management Server, which provides an API for access encryption keys as well as a web UI for managing keys and more.

This post describes how Spark Streaming achieves fault tolerance with source data from Kafka.  Early versions of the integration make use of a write-ahead log to recover data during a failure, and there's a newer direct API which provides exactly-once semantics. The post also describes some of the expected failure scenarios and how the implementation achieves robustness.

Members of the Core Data Libraries team at Twitter presented on their experiences with optimizing Hadoop performance. Among the topics covered include using Xprof for profiling MapReduce jobs, optimizing intermediate compression, using Scala macros to implement raw comparators, and column projection and filtering with Parquet.

Stream processing has recently become a big topic, but it's often most useful when used in conjunction with a batch framework. This presentation describes, at a high-level, a few different frameworks for achieve this. They include Summingbird, Spark, and Flink.

Twitter has done a lot of work to speedup the launching of and reduce startup bandwidth of MapReduce jobs. The MapReduce Distributed Cache builds on the YARN local cache to expose and API for caching job resources like jars and files. With these changes, jobs typically submit less than 2MB of data to HDFS during job startup and nodes in the cluster download less than 840KB. This presentation describes the design and implementation of this feature.

The Cloudera blog has a guest post that describes the experiences at LinkedIn that led to Kafka, the high-level concepts behind Kafka, and the success of Kafka as the core of the data platform at LinkedIn.

This presentation describes a number of pitfalls and related improvements/best practices when it comes to administering a Hadoop cluster. Topics covered include configuration (instance configuration and jvm settings), issues related to metadata files, HDFS ACLs, HDFS Snapshots, and DataNode volume failures.

This post describes how the folks at Hammerlab integrate bioinformatics tools with Hadoop. Since the tools expect a POSIX-compatible filesystem, they've configured NFS access to HDFS. The post describes the workflow engine that they use and one of the gotchas of the NFS setup.

This presentation gives an overview of building an end-to-end implementation of the lamdba architecture (speed + batch layers) with Spark streaming, Kafka, Cassandra, and Akka (for data ingestion). It contains brief introductions to Kafka and Cassandra, snippets of code, and examples of how to hook all the pieces together.

Netflix presented at Hadoop Summit on their data pipeline which sends data to S3 to be queried by Presto. They cite the speed, scalability (it's running on >200 nodes), ANSI SQL support, and AWS support as reasons that they use Presto. Their presentation also covers some of Netflix's contributions and several upcoming features.

Huawei is an early adopter of Spark. This post describes some of the problems their solving with Spark (and the massive data volumes that they see when capturing raw data), summarizes some of their open-source contributions, and enumerates some features they're working on for future releases.


The DBMS2 blog reports that Teradata is planning on supporting Presto (the open-source SQL-on-Hadoop and more from Facebook). Several engineers from the Hadapt acquisition will work on Presto, and Teradata will sell support subscriptions. The post has more details on Presto and the short-term roadmap.

MapR and Microsoft have announced a partnership to bring MapR's distribution (including MapR-DB) to the Microsoft Azure cloud. Look for the offering sometime this summer.

KDnuggets has an interview with Beth Smith of IBM about the evolving role of analytics and analytics software. Among the topics covered are IBM's Hadoop distribution and contributions, IBM's membership in the Open Data Platform, and IBM's commitment to Spark (they recently opened a Spark Technology Center in San Francisco).


Hortonworks announced HDP 2.3 this week, which contains a number of new features to improve operations, development, and security & data governance. Many of these improvements are made possible by a new version of Ambari, which includes support for smart configuration, a UI to manage the capacity scheduler configuration, a SQL editor, an Apache Pig Latin editor, and a new UI for Apache Falcon. For enterprise support customers, Hortonworks has introduced a new SmartSense program for proactive support.

DataTorrent has announced that they're open-sourcing DataTorrent RTS, which is a unified stream and batch processing system, as Project Apex. Apex is meant to compete with Spark and Storm while striving to offer more enterprise features and better performance.

A new version of BlueData's EPIC software for managing Hadoop clusters was released this week. It adds a number of new features, including integration with Apache Ambari, support for CDH 5.3 and HDP 2.2, support for Apache Spark 1.3.1, support for Kerberos, and more. The new version also supports running Hadoop or Spark within docker containers.

MapR 5.0 was released this week with several new features. These include a new data replication framework to replicate data from MapR-DB to ElasticSearch, the addition of Apache Hadoop 2.7, support for Apache Drill 1.x, and new auditing capabilities that produce JSON files and integrate with Apache Drill.

For folks working with Apache Kafka and using Puppet for configuration, there's a new puppet module for managing the Confluent Platform Schema Registry.

Airflow is a new open-source workflow system from Airbnb. While the system is not Hadoop-specific, it includes support for Hive, Presto, HDFS, and other pieces of Hadoop infrastructure. Airflow is broken into several components including a web UI, a metadata repository, and a web CLI.

Version 0.3.2 of Hivemall, the machine learning UDF library for Hive, includes support for anomaly detection using Local Outlier Factor and support for polynomial features when performing non-linear regression.

Cascading 3.0 was released this week with support for Apache Tez as a backend. Cascading supports all the major Hadoop distributions (CDH, MapR, HDP) as well as several Hadoop service providers (EMR, Qubole, Altiscale).

LinkedIn has open-sourced Pinot, their realtime distributed OLAP database that integrates with Kafka and Hadoop. Pinot is used to power many analytics products at LinkedIn, where it provides < 100ms of end-end latency.

Apache Spark 1.4.0 was released with the new SparkR bindings, a new visualization feature, Python 3 support, and many improvements to Spark SQL, MLlib, and Spark Streaming.

PyKafka (formerly known as Samsa) has released a 1.0 version. The new version supports Kafka 0.8.2 and aims to have similar features to the JVM Kafka client. It doesn't yet have asynchronous producer support.

Version 0.9.4 of kafka-python, another popular python Kafka client, was released. This version focusses on stability, bug fixes, documentation improvements, and cleanups.

Cloudera Enterprise 5.3.4 includes fixes rolling upgrades, the fair scheduler, Impala, and HiveServer2. It also contains several fixes to Cloudera Manager.

Sparkit-learn is a new project to bring together PySpark and scikit-learn. The project readme has some example usages, such as building ArrayRDD and DictRDDs and build distributed classifiers.


Curated by Datadog ( ) UNITED STATES


Spark Summit 2015 Live Streaming (San Francisco) - Monday, June 15

Breakthrough OLAP Performance on Cassandra and Spark (Santa Clara) - Monday, June 15

Spark 1.4 Deep Dive & Spark Committers Q&A (San Francisco) - Monday, June 15

Elastic Meetup at MapR (San Jose) - Tuesday, June 16

Apache Flink: Unifying Batch and Streaming Modern Data Analysis (Redwood City) - Wednesday, June 17

Scaling with Couchbase, Kafka and Apache Spark (Pasadena) - Wednesday, June 17


Working with Spark's GraphX Libraries (Mason) - Tuesday, June 16

UNITED KINGDOM June 2015 Hadoop User Group Meetup (London) - Tuesday, June 16

Lighting the Spark! (London) - Wednesday, June 17


Apache Flink Hands-On (Stockholm) - Wednesday, June 17


Introduction to Practical Big Data with Apache Spark (Barcelona) - Monday, June 15


First Spark Belgium Meetup (Mechelen) - Wednesday, June 17


To Hadoop or Not to Hadoop (Kongens Lyngby) - Tuesday, June 16


Introduction to Apache Flink Workshop (Berlin) - Wednesday, June 17


Mysteries of the Universe, Spark and DataFrames (Krakow) - Thursday, June 18

RUSSIA Apache Spark (Moscow) - Thursday, June 18


Spark Streaming with Kafka (Bangalore) - Saturday, June 20

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit