Data Eng Weekly

Hadoop Weekly Issue #118

26 April 2015

This week's issue features great technical content covering a breadth of ecosystem topics—from Accumulo to HBase to Impala to integrating RDMBSes with Kafka to YARN (and more). News coverage has discussions of two contentious topics—the legitimacy of Spark and the positioning of the Open Data Platform. Finally, it was a busy week of releases: Apache Hadoop 2.7.0, CDH 5.4, and WANdisco Fusion (a new product) were all announced this week.


When it comes to Hadoop, a simple integration can become tricky due to compatibility, classpath woes, and more. This post describes how to write data to HBase from Scalding, and it details several issues encountered while testing the integration (and how to fix them).

The 1.7 release of Apache Accumulo will include a new cross-cluster replication feature for multi-datacenter deployments. A post on the Apache blog describes the architecture, how it's implemented on leader and follower clusters, and how circular replication is possible.

This presentation from the recent Hadoop Summit Europe gives an update on the state of YARN and future plans. There's a discussion of key features of the Apache Hadoop 2.6 release: rolling upgrades, long running services, scheduling (including node labels), and the Application History and Timeline Service. For future plans, the presentation discusses improving the scaling and features of the Timeline service (including storing data in HBase), new scheduling features, better support for containerized applications, and support for disk and network resource types.

This post shows how to integrate Elixir, which is a dynamic, functional language for the Erlang VM, with Apache Kafka. The examples leverage the kafka_ex client library.

The morning paper covered "Making Sense of Performance in Data Analytics Frameworks" this week. The author set out to verify and quantify the effects of network I/O, disk I/O and straggler tasks on Spark jobs run against the BDBench workload, the TCP-DS workload, and a production workflow from a Databricks Spark cluster. Their findings show that many jobs have become CPU-bound rather than I/O. This is because (among other releases) data is often compressed to trade CPU time for I/O, and Java requires serialization from byte buffer to Java objects.

The Cask blog has a post providing a brief introduction to the Capacity Scheduler in YARN. It describes how to configure the Capacity Scheduler for two equally-weighted queues, and how to use queues as part of the Cask Data Application Platform.

The Databricks blog has a post showing how to use pyspark to analyze Apache web server access logs. There are example queries show how to compute average content size and frequency of response code. The post also shows a feature of the Databricks cloud, display(), to graph a DataFrame inline in the notebook.

INDREX is a system for extracting relations for text data to provide querying capabilities via Cloudera Impala. A post on the Cloudera blog describes the system, the querying semantics, and some experimental results. When using Impala 1.2.3 instead of Pig 0.12, the system is almost two orders of magnitude faster at text mining query workloads.

Bottled Water is a new tool for capturing the state of a Postgres database as a stream of Avro records in Kafka. It consists of a Postgres extension implementing the logical decoding output plugin API and a client for forwarding data to Kafka. The post describes the architecture of the system, details several decisions (such as using Avro and Kafka), and describes several use-cases. The code for the project is available on github.

A post on the Apache blog describes join support in Apache Phoenix (the SQL-for-HBase engine), which includes enough support to run many of the TPC queries. Examples of supported joins include derived tables, correlated sub-queries, semi/anti joins, and union all.

If you're not using Postgres, then you can't use Bottled Water (see above) to move data from an RDBMS to Kafka. This post has an overview of the popular solutions for moving data (not necessarily in a streaming fashion a la Bottled Water) from other RDBMSes to Kafka.


PepperData, makers of software for optimizing Hadoop clusters, announced that they've secured $15 million in strategic and venture financing.

MapR launched free, on-demand training for folks working with Hadoop earlier this year. This week, they announced the availability of a new course, "Apache Drill Essentials." The course is aimed at developers and business analysts working with SQL.

This post examines criticism of Spark and Databricks, which it groups into three different categories. These are Hadoop Purism (Spark competes with Hadoop), Backseat Driving [of Databricks], and FUD (Spark is not enterprise-ready). The arguments seem to have some merit, and the conclusion of "Spark is Too Big to Fail" certainly does, too.

The Open Data Platform (ODP) is in the news again this week because MapR and Cloudera have both published blog posts about why they're not joining. Datanami has coverage of opinions on both side of the argument with WANdisco describing why they joined the ODP.

This week's O'Reilly Data Show podcast features an interview with Michael Stack about Apache HBase. There's a partial transcript of the interview on the website. One of the interesting take-aways is that HBase is now seeing contributions from folks who worked on BigTable (which is the inspiration for HBase) at Google.

Cloudera has announced a new training course for Cloudera Search. It's a three-day training focussing on ingestion, indexing, and querying for developers and analysts.

HBaseCon is in just under two weeks in San Francisco. The Cloudera blog has a sneak peak at the ecosystem track, which has talks covering topics like Apache Phoenix, integrating HBase and Hive, and Apache Kylin (an OLAP engine atop of HBase).


Apache Sentry 1.5.0-incubating was recently released. New features include column-level access control, high-availability, and more granular privileges (e.g. for CREATE, DROP, INDEX, LOCK).

WANdisco has announced WANdisco Fusion (WDF). WDF acts as a proxy in front of a Hadoop cluster to provide active-active replication. Unlike the previous version, which ran alongside the NameNode, WDF supports file systems other than HDFS (including EMC Isilon, MapR, and Amazon S3). Datanami has a lot more details on the new product.

Apache Hadoop 2.7.0 was released this week. The new version is not yet considered production-ready, but it has a number of important new features. These include the dropping JDK6 support, support for Windows Azure Storage, file truncation in HDFS, support for variable-length blocks in HDFS, pluggable YARN authorization, a speedup in the FileOutputCommitter for large jobs, and a new nntop tool providing top-like information about the NameNode.

Cloudera has released Cloudera Enterprise 5.4. It includes new versions of several components including Apache Spark 1.3, Impala 2.2, and Apache HBase 1.0. Main improvements of the release are security (e.g. SSL and Kerberos support for Apache Flume Thrift source/sink, Cluster-wide redaction of sensitive data in logs), performance (e.g. beta of Hive-on-Spark, MultiWAL for HBase RegionServers), Data Management and Governance (mostly in Cloudera Navigator), and more. In addition, Impala 2.2 has some long sought features like read-only support for data stored in S3 and support for nested data types in Parquet files.

Hue 3.8 has been released with a number of enhancements and new features. These include a new Spark REST Job Server, a new Oozie Editor, performance improvements, improvements to search (2D maps, import/export dashboard, and more), security improvements, and a Spark Notebook (beta). The Hue team has written another post with details on the new Spark Notebook application.

KafkaTool is a new cross-platform UI for Kafka. It provides the ability to view Kafka clusters, contents of partitions and messages, offsets of Kafka consumers, and more.


Curated by Datadog ( )



Big Data Monitoring (Mountain View) - Wednesday, April 29

Managing Resources Seamlessly with YARN, Mesos & Myriad (San Ramon) - Wednesday, April 29

Spark Introduction (Glendale) - Wednesday, April 29

Scaling with Couchbase, Kafka, and Apache Spark (Culver City) - Thursday, April 30

HBase Meetup at MachineZone (Palo Alto) - Thursday, April 30


Simplify Your Architecture: Say No to Lambda, Presented by VoltDB (Portland) - Tuesday, April 28

North Carolina

Modern Data Integration: Paradigm Shift (Charlotte) - Wednesday, April 29


Long-Lived Spark Applications (Vienna) - Monday, April 27


Special Presentation Night: Spark Under the Hood (Cambridge) - Tuesday, April 28


Elastic Analytics with Spark, Mesos and Docker by Brenden Matthews (London) - Thursday, April 30


Apache Spark: Show and Tell #1 (Stockholm) - Tuesday, April 28


Hadoop & Security (Paris) - Wednesday, April 29


Rstats + Azure HDInsight + Azure ML (Copenhagen) - Monday, April 27


Hot Topics in Dataflow, Flink's Runtime, & Community Update (Berlin) - Wednesday, April 29


Hadoop + SAS (Warsaw) - Wednesday, April 29

Apache Storm: Real-Time Stream Processing (Gdansk) - Thursday, April 30


Real-Time Streaming with Apache Spark Streaming and Apache Storm (Zagreb) - Monday, April 27


NoSQL Cutting through the Hype & Pervasive Analytics: Building a Data Strategy (Singapore) - Monday, April 27


Hive on Spark + Hands-on Tutorial (Sydney) - Thursday, April 30

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit