Data Eng Weekly

Hadoop Weekly Issue #246

07 January 2018

Lots of releases welcoming the new year, including from Apache Samza, Apache NiFi, and two new Kafka-related projects. In technical posts, there is great coverage of Apache Impala, Apache Spark + Kubernetes, Apache Pulsar (incubating), and the NATS Streaming distributed log implementation.


This is a quick post that describes how to use Apache Zeppelin to query an Apache Cassandra cluster in a AWS VPC.

A few weeks ago, we had a post about how NATS Streaming stores data. This second post describes another key aspect of a distributed log—data replication, which is important for both durability and availability. The post describes Apache Kafka's strategy,which makes use of in sync replica sets and Apache ZooKeeper, as well as how NATS implements data replication using the Raft Consensus Algorithm.

Hue's create table wizard generates the metadata for a new table by analyzing file contents (and allowing for further customization).

This presentation has a very good overview of several of the major performance features of Apache Impala, including things like ordering scans, common join patterns, determining ideal join type and order, LLVM codegen for some operations (e.g. hash join, aggregation, sort), and runtime bloom filters.

This post describes how Spark's experimental support for Kubernetes, and its upcoming support for in-cluster client mode, are implemented. Specifically, it covers the Spark Driver, the Executor, the (optional) External Shuffle Service, and the Resource Staging Server.

Lightbend has open-sourced two Scala libraries for Kafka Streams. The first is a new Scala wrapper for the Kafka Stream DSL API and the second implements a HTTP service for querying local state computed as part of a streams program. The latter includes a metadata service to enable queries across multiple partitions/services.

This post demonstrates Apache Pulsar's (incubating) Kafka-compatibility with the use case of loading log data into Elasticsearch. Pulsar's Kafka bindings can be dropped in for Kafka Connect in addition to base Producer and Consumer APIs.


The Hadoop ecosystem has gotten quite large—and vendor support varies by version. This post on the Gartner blog covers the latest version of several ecosystem projects (e.g. core Hadoop, Hive, Pig, ZooKeeper) across several vendors—AWS, Cloudera, Google, Hortonworks, and MapR.

The Roaring Elephant podcast covers Pulsar, Hadoop and Kerberos, Hadoop on Containers, and whats in the Hadoop 3.0 release.

This analysis of Hadoop 3.0 covers several of its new features: erasure encoding, YARN scalability and reliability, and heterogenous clusters. It also discusses upcoming support for long-running processes, Docker, and Kubernetes.


Amazon EMR 5.11.0 was released, and it includes new versions of Spark and Hive. It also adds integration with Amazon SageMaker for machine learning pipelines.

Amazon BookKeeper 4.6.0 was released. It includes performance improvements, a new API that uses the Builder pattern, a new admin / management API, and more.

Apache NiFi MiNiFi 0.3.0 adds support for running in Windows, core library upgrades, and more.

Apache MADlib (which enables big data machine learning from SQL) 1.13 was released with several bug fixes and new features including an implementation of HITS and improvements to k-Nearest Neighbors.

Version 1.7.1 of Apache Sentry was released with a security fix for a CVE.

Apache Samza 0.14.0 was released. Highlights include performance improvements (with RocksDB for local state, incremental checkpointing, and asychronous requests to external services), a new API for complex stream processing, pluggable input/output systems, and many enhancements to make it easier to deploy large scale clusters.

Scio 0.4.7 was released with several new features and bug fixes.

Blue Apron has open sourced their Protobuf converter plugin for Kafka Connect.

The Apache NiFi project also released the first version of a new component, the NiFi registry, which is used to store and manage shared resources like flow definitions.

Formulation is a new library for working with Apache Avro from Scala.

There's a new Kafka Connect plugin for time series databases. It currently supports Graphite as a backend, but it aims to support additional backends


Curated by Datadog ( )



Intro to Azure Data Factory (Sunnyvale) - Thursday, January 11


George Trujillo: Designing the Next-Generation Data Lake (Austin) - Wednesday, January 10


KSQL: An Open Source Streaming SQL Engine for Apache Kafka (Tysons) - Wednesday, January 10


Women in Data Science (Philadelphia) - Tuesday, January 9


Apache Beam: Use Case in Finance + IO in Beam (London) - Thursday, January 11

Meteor-Proof Infrastructure + Bitcoin Meets Hadoop (Bristol) - Thursday, January 11


Data Engineering: Microservices with Apache Kafka and Reporting with Spark (Sydney) - Tuesday, January 9


Our First Apache Kafka Meetup in New Zealand (Wellington) - Thursday, January 11