Data Eng Weekly

Hadoop Weekly Issue #132

03 August 2015

This week's issue has a lot of great content covering topics like Bigtop, HBase, MapReduce, Kafka, and Spark. And in case you were afraid we were running out of new ecosystem projects, SOMOA and Zeppelin both released their first version since joining the Apache incubator this week. In addition, Phoenix, NiFi, and Amazon EMR each had new releases this week.


The Insight Data Engineering blog published an interactive map of the various tools and systems that make up the data engineering ecosystem. It's a handy mechanism to get an overview of how various pieces (it doesn't cover every single project, but it's easy to get the idea) fit together.

Apache Bigtop is an integration and testing framework for building Hadoop ecosystem projects and publishing a bundle of artifacts. Having a complete development environment for all of these projects can be tricky, and this post discusses a neat solution to that problem. Leveraging docker, this tutorial describes how to build RPMs and customize versions of various packages using Bigtop.

Actian DataFlow is an alternative compute framework with some similarities to Tez (execution model) and Spark (programming interface). This post argues that while its mature and powerful, it likely won't succeed as part of the Hadoop ecosystem as proprietary software (a more general argument).

Since HBase 0.98.4, the HBase Thrift server has supported pass-through authentication. HBase 1.0 made this more powerful by supporting dynamic Thrift client users. This post describes the integration and points to various code snippets for anyone trying to take advantage of the authentication features.

This post describes the steps involved in MapReduce job submission in MRv1. The post also has some handy instructions for tweaking log levels in case you need to selectively enable debug logging to isolate an issue.

Apache Zeppelin (incubating) is a system with similarities to IPython notebooks for providing a web-based interface for Spark and Hive (and much more). The Cloudera blog has a post describing how to configure Zeppelin to run with CDH in CentOS (although most of the instructions are distribution agnostic). The post includes some example commands and screenshots from the Zeppelin UI.

Yahoo describes how they landed on Druid as a database for their interactive data applications (after evaluating many others such as Hive, RDBMS, Spark, and Impala). Key features of Druid include multi-tenancy, high availability, lock-free ingestion, extensibility (easy to integrate with proprietary systems for ingestion), and native support for custom algorithms.

The Databricks blog has details on the ML Pipelines API in Spark 1.4. Changes include a stable API, a dozen new tokenizers, and better API support from Python.

In the third post in a series, the Hortonworks blog describes two new features in HDP 2.3 and several general improvements. The new features are provisioning with Cloudbreak (for clusters on Azure, EC2, and Google Cloud) and SmartSense for proactive support. General improvements covers several changes to YARN, HDFS, Pig, Sqoop, and Oozie.

Spark streaming, unlike many other stream processing frameworks, uses microbatches to breakup work. This post describes some of the advantages of this approach, such as dynamic load balancing, faster failure recovery, and unified library support (since a streaming microbatch is also an RDD).

This post describes how Kafka implements compression, and how some improvements to the implementation made it about 1/3rd faster. The speedups are thanks to smarter memory management—a custom BufferingOutputStream, which eliminates multiple extraneous copies to improve performance and reduce GC overhead.

The Cloudera blog has a post describing an architectural model that combines stream-processing, batch processing, and a real-time service to implement fraud detection. At the core of the system is Kafka, but it also uses HDFS and Flume.

The MapR blog has a post describing how to write, package, and deploy a custom UDF for Drill. For performance, Drill does code generation and there are some tips for writing a good UDF and some non-standard packaging concerns.


Since many of the early adopters of Hadoop have been internet companies, its interesting to hear perspectives from other industries. This post describes how Geological and Geophysical areas are tackling big data problems. Not surprisingly, even in other industries data cleansing is a major theme.

Given that many consumers are moving more and more to mobile, it's not surprising that Qubole is seeing big growth from its mobile-centric clients (Flipboard, Pinterest, MyFitnessPal among them). There are more details on Qubole's growth and their platform in a recent press release.

The Qubole blog has a look at several of the main use cases for Apache Spark. It also touches on Spark's main weakness—multi-tenancy (since it can be memory hungry).


Apache NiFi 0.2.1 was released this week. It contains a bug fix for account creation and updates the references to the incubator since NiFi recently graduated to be a top-level project.

Apache Phoenix, the SQL-on-HBase engine, released version 4.5. The Apache blog has a summary of new features and improvements in the 4.4 and 4.5 releases. These include Spark integration, support for SELECT with FROM, client-side metrics per statement, and many new builtin functions.

Apache SOMOA (incubating) is a pluggable streaming machine learning library with backends for Storm, S4, and Samza. Version 0.3.0-incubating was released this week, which is the first release within the incubator.

Amazon EMR release 4.0.0 includes updated versions of Hadoop, Hive, Pig, and Spark. The packaging is also now based on Apache Bigtop, which will make filesystem layout more familiar to users of other distributions.

Apache Zeppelin 0.5.0-incubating was released. This is the first release as part of the Apache incubator, and it includes support for Spark, Flink, Hive, and Tajo.


Curated by Datadog ( )



Hands-on Introduction to Hadoop, HDFS, Hive, Ambari & Pig (Santa Clara) - Tuesday, August 4

HadoopSF August 2015 Meetup (San Francisco) - Wednesday, August 5


Couchbase, Kafka, Spark, Hadoop: Polyglot Persistence and the Big Data Pipeline (Tempe) - Wednesday, August 5


Apache Flink Crash Course (Chicago) - Tuesday, August 4


Apache Spark: What Is All the Hype About? (St Petersburg) - Thursday, August 6


Data Ingest and Processing: Spotlight on Streaming (Toronto) - Thursday, August 6


Distributed Stream Processing (London) - Wednesday, August 5


Data Infrastructure and Data Science Meetup (Ankara) - Wednesday, August 5


Apache Spark: Spark Streaming, DataFrames, Zeppelin and More (Tel Aviv-Yafo) - Monday, August 3


Tokyo Spark Meetup (Tokyo) - Thursday, August 6


Spark 1.4 Announcement + Spark Streaming Use Case + Tableau Spark Driver (Sydney) - Monday, August 3