Data Eng Weekly

Hadoop Weekly Issue #218

29 May 2017

Short and sweet issue this week, covering Spark's structured streaming, HDFS's new Maintenance State, data exploration tools at Stitch Fix, new products from Cloudera, MapR, and Databricks, and more.


Spark's structured streaming has a "ProcessingTime" trigger that will attempt to process new data at regular intervals (like cron). For a cluster that is elastic in size, this can save money by only bringing up the necessary resources when the trigger fires. With that said, jobs can still be stateful, and structured streaming has a few other features (such as bookeeping of failures and table-level atomicity) that make it more attractive than a normal batch operation.

The Cloudera blog has an overview of a new feature in HDFS call the "Maintenance State." Essentially, it provides a mechanism for temporarily removing nodes from the cluster without causing a replication storm (this can be useful for e.g. patching an entire rack at a time). This feature requires a new "maintenance" file (the dfs.hosts file format isn't rich enough) that is JSON-like. The post has more details on the implementation and how to use it (in CDH 5.11+, at least).

The Algorithms & Analytics team at Stitch Fix has written about their data exploration tool, Dora. The data system is backed by an Elasticsearch cluster, whose data is generated by Spark from data in S3.

Hadoop, Spark, and the broader ecosystem offer the ability to process complex data with nested structs, arrays, maps, and more. Support for this complex data is great in a programmatic setting, but it's more tricky to use from SQL. This post looks at the TRANSFORM operation and other "Higher Order Functions" that have been added to Spark SQL. This feature is available in the Databricks 3.0 beta, and there's a JIRA ticket open (SPARK-19480) to add it to Spark core.

This post provides an overview and comparison of Kafka Connect and StreamSets data collector. Both tools are capable of shuffling data between systems, which is the main focus of the comparison.

In another comparison with Kafka, this post provides a high-level overview of the similarities and differences between Kafka and Amazon Kinesis. It primarily looks at the system-level (primitives like topics, streams, partitions and shards) and getting data into the system.


Cloudera has announced their first hosted service, Cloudera Altus. It's a "Data Engineering service" that takes care of provisioning clusters and running jobs in an existing AWS account. The post has more details—at first glance, it resembles many other Hadoop as a service offerings, so it'll be interesting to see where Cloudera tries to differentiate.

Databricks has announced the Databricks Runtime 3.0 beta. Based on Apache Spark 2.2.0 release candidates, it also includes improvements to S3 throughput, better performance, and support for transactional writes to S3.

The Apache Knox team disclosed CVE-2017-5646: "Apache Knox Impersonation Issue for WebHDFS." Users are encouraged to upgrade to Apache Knox 0.12.0.

ZDNet has coverage of MapR's new deep learning product, Quick Start Solution (QSS).


Apache NiFi 0.7.3 was released with reliability, performance, and other fixes.

Version 0.4.0 of Apache Arrow, the in-memory columnar data layer for a number of Hadoop ecosystem projects, was released. Highlights include a beefed up JavaScript implementation, Windows Python Support, and more.


Curated by Datadog ( )



Talend Presents Sensors, Spark and Kafka: Applied Machine Learning (Addison) - Tuesday, May 30


Tracking Trains in Real Time Using Stream Processing in Apache Kafka and Storm (Jacksonville) - Tuesday, May 30


Large-Scale Text Processing Pipeline With Spark ML and GraphFrames (Philadelphia) - Thursday, June 1


Toronto Apache Spark #20 (Toronto) - Wednesday, May 31


Using Apache NiFi to Empower Self-Organizing Teams (London) - Wednesday, May 31


Discover Khermes, an Open-Source & Distributed Data Generator for Apache Kafka (Madrid) - Thursday, June 1


Big Data Analytics (Kontich) - Wednesday, May 31


7th Recommender Systems Amsterdam Meetup (Amsterdam) - Tuesday, May 30


Our First Kafka Meetup with 2 Amazing Speakers Form Confluent (Walldorf) - Tuesday, May 30


Apache Spark: A Unique Engine for Big Data Processing (Milan) - Thursday, June 1


DataCamp Vienna: Spring Edition (Vienna) - Tuesday, May 30


Building Streaming Data Pipelines (Budapest) - Wednesday, May 31


Workshop on Spark 2.x (Pune) - Saturday, June 3

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit