Data Eng Weekly

Hadoop Weekly Issue #180

24 July 2016

As stream processing continues to be a hot topic, Kafka is showing some maturity—there are two articles this week on Kafka security. In addition to stream processing, there's a good mix of content with articles on core Hadoop, Hive, and data infrastructure automation.


Datadog has a four-part blog post series on monitoring Hadoop. The first three parts are Datadog-agnostic and describe Hadoop architecture, important metrics for HDFS, MapReduce & YARN, and strategies for collecting metrics. This will likely prove to be a valuable guide for building out Hadoop monitoring and alerting infrastructure.

Heroku has written about their move to an asynchronous, Kafka-based integration pattern. They've build a HTTP Proxy for Kafka, which in addition to HTTP POST for publishing, supports consuming via websockets. The post has many more details about the rollout of this infrastructure component at Heroku.

In the latest in a series on the Altiscale blog about debugging Hadoop NodeGroup performance issues, this post gets to the bottom of the two problems previously discovered.

The Hortonworks blog has an overview of the various types of disaster recovery and backup support in HBase. Recently, the community has been working on incremental backup tools. The article describes several different backup targets—intra-cluster, inter-cluster, and S3/other long-term storage as well as the commands needed to perform incremental backups and restore from a backup. In terms of restoration strategies, there are several approaches (each with its own trade-offs).

WePay has another post about their BigQuery-powered data platform. This time they look at loading data into Google Cloud Storage and BigQuery from production MySQL databases as well as real-time writes using the streaming API. The post discusses several nuances of the process—handling of mutable data, data quality checks, permissions, service accounts, and automation.

The Confluent blog has an article that describes security features in Apache Kafka, with a concentration in the Kafka Stream use-case. Features include encryption-in-transit (both for the client-server and server-server communication) and client authentication/authorization. These settings are disabled by default, and the post has an example of configuring them for a Kafka Streams application.

While many data processing environments start out as a set of cron jobs, but that's usually not a good long-term solution. This post describes the major problems with cron, and suggests some alternative systems that aim to solve these and other problems with job scheduling.

Hortonworks has written about the recently released Apache Hive 2.1. This is the first version with Hive's Live Long and Prosper (LLAP) support. In addition to LLAP, Hive 2.1 has smarter map joins, better vectorization, and a better cost-based optimizer. The post includes some benchmarks and related configuration tweaks needed to get the best performance for Hive.

In another Kafka security post, the IBM Hadoop Dev blog describes how to enable and configure Kerberos. In addition to configuration settings, there are examples of several admin functions (such as adding ACLs for a new user account).

MapR has a whiteboard walkthrough on Apache Flink's savepoints for stream processing. Savepoints solve operational issues commonly found in stream processing frameworks like support for reprocessing and no-downtime upgrades. As usual, there's both a video and transcript of the presentation.

This presentation from Data Day Seattle gives an overview of Apache Airflow (incubating). After motivating Airflow and introducing its major features, the presentation describes use cases at Agari: 1) Message Scoring, which involves Spark, Amazon S3, managing importers via AWS Auto Scaling Groups and 2) Model Building, which is performed with Amazon EMR. The post also looks at SLAs for correctness and timeliness with Airflow.


Kafka Summit was a few months ago now, but this is a great summary of the conference themes, lessons learned, stream processing presentations, and more.

The MapR blog has a post that revisits, as we're half-way into 2016, some big data predictions made at the start of the year. Many of the predictions have come true, and I think it's interesting to see what missed (healthcare) and what wasn't anticipated (containerization).

Syncsort has another in its series of expert interviews, this time with Dr. Ellen Friedman who is the author of "Streaming Data Architecture: New Designs Using Apache Kafka and MapR Streams." The three part interview covers Hadoop in industry, big data stream processing, and more.

InfoQ has posted videos from QCon New York 2016. There are a number of relevant presentations, including those covering streaming data at Spotify and stream processing with Apache Kafka.

Splice Machine has open sourced their RDBMS built on Hadoop, HBase, and Spark. As part of the announcement, they've also provide the ability to launch Splice Machine in an AWS-powered sandbox.

Altiscale has announced that the Altiscale Data Cloud is now compliant with the ODPi Runtime Specification.


Apache Chukwa was one of the first log aggregation and analysis frameworks. Development stalled for some years, but the project has now seen two releases in the past 8 months. The 0.8.0 release has a new file format (based on Parquet), an improved HBase schema, and a number of of bug fixes and improvements.

Apache HBase 1.2.2 was released this week. The maintenance release resolves a number of bug fixes.

Cloudera has announced Cloudera Enterprise 5.8. This release brings Cloudera Navigator Optimizer to general availability. It also features new versions of Impala and Hue. The Cloudera blog has a post on the release and the new optimizer.


Curated by Datadog ( )



Big Data Application Meetup (Palo Alto) - Wednesday, July 27

Apache Ignite In-Memory Data Fabric for .NET (Mountain View) - Wednesday, July 27

Hands-On with Twitter Heron (San Francisco) - Saturday, July 30


Walking a Fine Line: Using Apache Spark and Cassandra (Portland) - Wednesday, July 27


Seattle Scalability Meetup (Seattle) - Wednesday, July 27

Spark at Zillow & Realtime Analytics: Spark, NiFi, Kafka, Cassandra, ES, Docker (Seattle) - Thursday, July 28


Databricks Community Edition: Spark 2.0 (Lehi) - Thursday, July 28


Cleveland Big Data and Hadoop User Group (Mayfield Village) - Monday, July 25

North Carolina

July CHUG: Leveraging Mainframe Data in Hadoop (Charlotte) - Wednesday, July 27

New York

Apache Phoenix NYC Meetup (New York) - Monday, July 25

Combining Spark and Open Source Elements (New York) - Tuesday, July 26


Google Cloud Dataflow via Scio & Google Bigtable Learnings (Somerville) - Tuesday, July 26

Apache Phoenix and HBase: Past, Present, and Future of SQL Over HBase (Bedford) - Tuesday, July 26


Toronto Apache Spark #11 (Toronto) - Wednesday, July 27

Apache NiFi Presentation by Joe Witt (Toronto) - Thursday, July 28


Hive on ACID and V2 by Alberto Romero (Barcelona) - Thursday, July 28


Best of Criteo Labs Tech Talks (Paris) - Thursday, July 28


Building a Fully Automated Fast Data Platform (Munich) - Thursday, July 28


Apache Kafka Workshop (Cluj-Napoca) - Wednesday, July 27


Spark 2.0 101 + Spark on Knime (Sydney) - Wednesday, July 27