Data Eng Weekly

Hadoop Weekly Issue #187

18 September 2016

There's a lot of variety in this week's issue—Kafka, NiFi, Spark, HDFS, Impala and more are all covered in technical articles. And there are a number of releases including a new distributed messaging system, Pulsar, which was recently open-sourced by Yahoo.


This post describes how to monitor Apache Kafka with Grafana and InfluxDB. jmxtrans is used to deliver metrics from Kafka to Influx.

The Hortonworks blog has an introduction to the new zero-master clustering that is part of Apache NiFi 1.0.0. Built on Apache ZooKeeper, the clustering mechanism creates a Cluster Coordinator from a pool of servers rather than explicitly designating a master.

The Cloudera blog has a post that demonstrates how to use Apache Spark for text analysis. It looks at the frequency of letters in and mechanisms of decoding an encoded message.

The Altiscale blog describes the architecture behind the new and updated HDFS web UI. In addition to a modern look and feel, the browser-based application is powered by REST APIs rather than JSP.

This tutorial describes how to use mysql triggers with Apache NiFi to capture inserts and updates to rows in a mysql table. With NiFi, that information is sent to Apache Hive for analysis.

The AWS big data blog has a post describing key concepts in Kinesis SQL analytics—particularly windowing and time semantics. The post also includes a tutorial for running a Kinesis analytics application that outputs results to Redshift.

Cloudera has posted an article analyzing the speed and cost of Apache Impala (incubating) queries over data stored on EBS volumes vs. S3. In short, EBS is more performant (2.4x in Cloudera's testing) and cheaper (although not by much), but S3 remains a good option for ad hoc analysis using transient clusters.

Apache Phoenix, the SQL engine for Apache HBase, includes a number of new features in version 4.8.0. Released just over a month ago, the new version adds support for namespace mapping and a new integration with Apache Hive.

Confluent's Log Compaction series has news on upcoming Apache Kafka releases (including details on a new time-based release plan). In addition, there are links to several recent presentations and articles on Kafka.


Hivemall, which is a machine learning library for Apache Hive, has entered the Apache incubator.

"Fast Data Architectures for Stream Processing Applications" is a free (behind an email-wall) O'Reilly report on the state of stream processing.


Pulsar is a recently open-sourced distributed messaging system from Yahoo. The post looks at its architecture and performance goals/characteristics. Pulsar is widely-used at Yahoo—over 100 billion messages/day are published to its brokers across clusters in over 10 data centers.

Altiscale has announced support for Apache Spark 2.0.

The Eclipse plugin for Apache Pig, pig-eclipse has announced a 1.1.2 release which adds support for Apache Pig 0.15 & 0.16, information when hovering over Pig keywords, and a bug fix for auto-complete.

Hortonworks DataFlow 2.0 is now generally available. The post highlights a number of improvements in the release, which is based on Apache NiFi 1.0.

Apache HBase 1.2.3 was released. The maintenance release resolves 49 issues

Apache Storm 0.10.2 was released to address several issues in the 0.10.x line.


Curated by Datadog ( )



#OCBigData Meetup #19 (Irvine) - Wednesday, September 21

Talk Night at Carbon Five LA (Santa Monica) - Wednesday, September 21

Airflow Meetup (San Francisco) - Wednesday, September 21

Data Science at Scale with HAWQ and MADlib and Hadoop (San Francisco) - Wednesday, September 21

Uber Engineering Tech Talk Series (San Francisco) - Thursday, September 22

Data Science at Scale with HAWQ and MADlib and Hadoop (Santa Clara) - Thursday, September 22


IoT with Spark & Scala Spark Datasets (Seattle) - Thursday, September 22


Integrating Real-time Data Streams with Spark & Kafka (Plano) - Tuesday, September 20


Apache Kudu and Watson Analytics (Green Bay) - Tuesday, September 20


Apache Drill: Just Query It (Columbus) - Thursday, September 22


Deep Dive with Spark Contributor Chris Fregly (Atlanta) - Thursday, September 22


Kick Off the Fall with DC Spark! (McLean) - Thursday, September 22

New York

Open House: Big Data Processing with Apache Spark (New York) - Monday, September 19

Twitter Heron in Practice (New York) - Thursday, September 22


Streaming in the Time of Spark 2.0 (Montreal) - Wednesday, September 21


Big Data Architecture Patterns on Google Cloud Platform (Reading) - Wednesday, September 21

Fast Data with Apache Ignite & Apache Spark (London) - Thursday, September 22


How-to Spark (Hamburg) - Wednesday, September 21

September Meet-up (Karlsruhe) - Thursday, September 22


First Data: The Big Data Journey of the World’s Biggest Payment Processor (Prague) - Wednesday, September 21


Apache Spark Streaming with Apache NiFi and Apache Kafka (Warsaw) - Wednesday, September 21


Interactive Spark Scheduling with Azkaban (Bangalore) - Saturday, September 24


Shanghai Big Data Streaming (Shanghai) - Saturday, September 24


Spark Meetup September (Sydney) - Wednesday, September 21