Data Eng Weekly

Hadoop Weekly Issue #241

19 November 2017

Lots of new releases this week, including Kafka, Hadoop, Hive, and Phoenix. Also, Databricks Azure and Azure Cosmos DB compatibility for Cassandra are both in preview, and there are great technical articles covering Kafka, StreamSets, Redshift, and the Dist-Keras deep learning framework.


The Landoop blog has an overview of their web and API-based tool, Lenses, for exploring data in Kafka. Based around the Lenses SQL Engine, it detects data types from streams and has support for real-time views, batch-queries, functions and "time traveling." There are tabular, tree, and raw views, and a Jupyter integration via the API.

The Qubole blog has a tutorial for how to use the Dist-Keras framework for deep learning as part of a Spark ML pipeline. While a small part of the post is Qubole specific, it's predominantly generally applicable to anyone looking to use Spark for deep learning.

Pivotal writes about how they migrated data between GemFire clusters using Apache Kafka for replication. There are some interesting details, including how they avoided an infinite loop using a technique similar to reverse path forwarding.

This post describes how to use the StreamSets Data Collector to save data to Amazon S3 and then load into Snowflake DB.

A secure design for data access requires making tradeoffs. This post describes how to ensure security for Amazon Redshift with multiple accounts, which has some apparent inconveniences that can actually be automated. The walk-through describes loading data via Apache Spark on Amazon EMR and shuffling data across accounts (via assuming roles with Amazon STS).


The Azure Cosmos DB has announced preview support for Cassandra API compatibility. The implementation aims to be wire compatible so that only a change to the connection string is needed.

Azure Databricks was announced this week. Currently in preview, there are a number of Azure-specific optimizations and features.

Insight has announced that they're expanding their program to a third city. The inaugural class of fellows in Boston will start their program in April. Applications are open now.


Apache Hive 2.3.2 was released this week. It's a bug fix release across several sub components, including the Hive metastore client and Kerberos.

Apache Phoenix 4.13.0 was released. It's compatible with HBase 0.98 and 1.3, and it includes improvements to collection of statistics, a critical bug fix for snapshot creation, and other bug fixes.

Apache ZooKeeper 3.4.11 is a new bug fix release with a number of improvements and bug fixes.

Confluent has released version 3.3.1 of their Confluent Platform distribution. There are a number of improvements, including to both enterprise and open source versions (which is now based on Apache Kafka and librdkafka 0.11.1).

MockStreams is a library for for unit testing applications built on Apache Kafka and Kafka streams. Version 1.5.0 was released with compatibility for Apache Kafka 1.0.

Apache Hadoop 2.9.0 was released with a number of major new features including to the Timeline Service, YARN Federation, YARN Web UI, HDFS, and the CapacityScheduler API.

A bug fix of Apache Kafka, version, has come out. There are some important fixes, including one for data loss.


Curated by Datadog ( )



Developing Java Streaming Applications (Atlanta) - Tuesday, November 21

IRELAND Leveraging Apache Kafka for Web Crawling and Data Processing (Cork) - Tuesday, November 21


The First Beam Stockholm (Stockholm ) - Wednesday, November 22


IBM BigSQL, 5 Years with Hadoop and HDF (Madrid) - Wednesday, November 22


Apache Kafka Goes 1.0 (Paris) - Tuesday, November 21


Mini-Batch Processing with Spark Streaming (Nieuwegein) - Wednesday, November 22

RUSSIA What’s New with Hadoop 3.0, Greenplum 5.0, Hive 2? (Moscow) - Wednesday, November 22


Recent Apache Hive Enhancements Powering Enterprise Data Analytics (Bangalore) - Wednesday, November 22

Data @ Scale Using Apache Spark (Bangalore) - Saturday, November 25