Data Eng Weekly

Hadoop Weekly Issue #198

02 January 2017

Lots of great content from the last two weeks, including a few year in review posts and new releases of Apache Spark, Apache Kafka, and Apache Flink. In technical posts, there is a bunch of variety, from HDFS to SparkR, including a great post on running Spark on YARN in production.


StreamSets is a tool in a similar space to Apache NiFi (and Apache Flume for some use cases) that has been gaining steam in the past year or so. This post shows how to build a custom origin for StreamSets Data Collector. The process is relatively straightforward, which makes it a powerful system for hooking into custom data sources.

Hortonworks has written about an initiative to bring fine-grained access control (i.e. column-level) to Apache Spark by having it read data through the Live Long and Prosper (LLAP) daemons from Apache Hive. LLAP is integrated with Apache Ranger, which enforces the authorization on read. A post on the Hortonworks blog has more details on how this works and a link to a knowledge base article that details how to enable it for a HDP 2.5.3.

The Cloudera blog has a post about how the HDFS DataNode detects file corruption and disk failure by performing periodic scans. These include block scans (to detect corruption), directory scans (to repair inconsistencies in cached data), and disk checks (to verify permissions and other basic sanity checks).

Inovex has shared lessons learned from running Spark streaming on YARN for over a year. Their streaming application loads data from a JMS message queue, joins with lookup data in HBase, and writes out to HBase using bulk puts. Topics covered include configuration settings (for YARN and Spark), handing of backpressure, deployments (using a marker file in HDFS with checkpointing), monitoring (including a set of custom metrics), and logging.

Amazon EMR now supports CloudWatch events for various state changes in an EMR cluster. Coupled with SNS, Lambda, SQS, or another integration, this can be used to programmatically respond to failure, auto-scaling events, or more.

Using an EMR bootstrap action, a new cluster can come up with JupyterHub (Jupyter with multi-user support), which supports R, Python, Spark, and much more. The AWS Big Data blog has a bunch of examples of using Jupyter with several Hadoop ecosystem components (including pyhive for Hive/Presto and SparkSQL).

Apache Apex has several mechanisms to monitor and debug a running streaming program. Among these is the ability to generate a stack trace of the application, which can be triggered from the command-line or the web interface.

The upcoming Hue 3.12 will include improvement to the SQL system (formatting, SQL autocomplete, data preview popups), email notifications, and more. Read about the upcoming features in the three following blog posts.

This post, in notebook format, provides background aimed at someone with an R background who is getting started with SparkR. It provides details on the programming model, data model, execution model, and some of the machine learning and other APIs.

As someone who's been through multiple Hadoop upgrades, I can sympathize with this analysis of effort and risk involved in major system upgrades. It's a good analysis and describes some steps that vendors can take to get customers to upgrade sooner than later.


Big Data Tech Warsaw takes place on February 9th. This post has a description of what to expect, including a sample of the 25+ talks.

Hortonworks recently unveiled Hortonworks Data Cloud for AWS. This post addresses some of the top questions about the service, including how it is purchased through the AWS marketplace, how to get data into a cluster, and the types of workloads it's targeting (for now, ephemeral clusters not HA).

InfoWorld has an interview with Hortonworks CTO Scott Gnau. Among the topics covered are HDPs release strategy/cadence and the important role of data-in-motion to the future industry.

There are several year end posts from this week. Hortonworks and Databricks covering top posts on their blogs from the year. Flink has a year in review of community growth and new features as well as a look ahead to what's underway for 2017.


IBM released the December refresh of Big SQL 4.2. It includes a number of fixes to the Hadoop integration.

Databricks has announced that their deep learning and GPU support has hit GA. The announcement includes a tutorial and example notebook for doing image classification using the Amazon P2 instance types on a Databricks cluster.

Casks' CDAP version 4 was released. The announcement includes an overview of the major features, which include an updated UX, an app store, and new platform features and improvements.

Apache Kafka was with a number of bug fixes and a few minor improvements.

Apache Flink 1.1.4 was released. It includes a number of bug fixes and improvements (over 75 resolved issues). The release page has more details, including a note to users using RocksDB.

Apache Spark 2.1.0 was released with event time watermarks (for stream processing) and support for Apache Kafka 0.10. There were over 1200 issues resolved, including changes to SparkSQL, MLlib, SparkR, and GraphX.

Version 0.11.0 of the Apache Knox gateway was released. This release adds a basic admin UI, metrics collection, and support for new systems.

Apache NiFi announced a new bug fix release, version 1.1.1.

multi-hbase is a prototype client library to support both the 0.94 and 1.2 branches of Apache HBase.

BigDL is a distributed deep learning library open sourced by Intel. By leveraging the Intel Math Kernel Library, it claims massive speedups over other implementations.


Curated by Datadog ( )



Fault Tolerance in Spark: Lessons Learned from Production (Austin) - Thursday, January 5


Introduction to Apache Kudu (Saint Louis) - Wednesday, January 4


Big Data Meetup 23 (Mississauga) - Sunday, January 8


A Deep Dive Into Structured Streaming/Predictive Analytics with SparkR (Istanbul) - Saturday, January 7


Meet Hadoop Family: Season 1 (Jakarta) - Thursday, January 5