Data Eng Weekly

Hadoop Weekly Issue #245

24 December 2017

After debating skipping this week's issue, it turns out there were a lot of great articles to share. Among them, (as is normal for this time of year) there are a couple of year-in-review posts. There are also quite a few great technical posts on Spark, Kafka, LinkedIn's Venice, the YARN capacity scheduler, and more. In releases, Pulsar, HBase, Ampool, and KSQL all unveiled new versions.


CochroachDB uses Multiversion concurrency control for concurrent access to data, and a transaction queue provides additional support for concurrent transactions. This post demonstrates these concepts with an easy to follow set of diagrams.

Spark can export metrics, which with a little bit of manipulation, can be stored in Prometheus for monitoring. This post describes the steps to get the integration setup.

This holiday-themed demo of Apache Pulsar (incubating) consists of a scalable stream processing application to help Santa build lego toys based on incoming email requests.

Impala can now take advantage fo column statistics when scanning data stored in parquet files. This post describes how it uses the min and max value as well as information stored in dictionaries to skip entire blocks of data during query. There are a few considerations when loading your data, which the post also describes.

It's possible to run Apache Cassandra inside of docker, but there are a few considerations. This post outlines some best practices and solutions to some common challenges.

Venice is a newer system to replace Voldemort for serving key-value data at LinkedIn. It ingests data in batch via Kafka, which is the focus of this post. In addition, Venice supports importing real-time data to implement the lambda architecture. The post describes some of the considerations for that, too.

This post describes the role that a streaming system, like Apache Kafka, can play in a microservices architecture. It argues that leveraging a streaming system can resolve some of the problems resulting from large amounts of data and interconnectivity that arrises from a microservices architecture.

One of the components of the NATS project is a distributed log similar to Kafka. This post, which is the first in the series, looks at the requirements and tradeoffs to consider in the data storage component of a Kafka-like system. NATS is open-source and written in Go.

The Hortonworks blog has a thorough overview of the YARN capacity scheduler. It describes hierarchical queues, several queue archetypes (including ad-hoc, batch, exploration, and always on), cpu scheduling, preemption, and more.

There's a new release of KSQL out this week (more below). This post provides an overview of using it to analyze data from the Wikipedia EventStreams.

This guest post on the StreamSets blog shows how Predera uses Hive, Spark, and StreamSets for their data pipeline. The walkthrough includes example commands from Hive and screen shots from StreamSets.


This post looks at some of new products and partnerships that Databricks has announced in the past year, including Databricks Delta and Azure Databricks.

Apache Flink has a year in review post, covering community growth, meetups and conferences, ecosystem and feature growth, and plans for 2018.

The Google Cloud Platform has post highlighting a number of lesser known facts about BigQuery. These include its support for User Defined Functions, several of the enterprise features for identity and access management, cell-level access control, and audit logging.

Learning Apache Apex, which was released in November, is available as a $5 discounted eBook.


Version 1.21.0-incubating of Apache Pulsar was released. Key changes include enhancements to the Kafka API wrapper, an upgrade to Netty, better scalability for large number of topics, and secure replication via TLS.

The Ampool Data Service, which is built on Apache Geode, has been open sourced. Called Monarch, the code is on github under an Apache License.

Version 0.3, the December release, of KSQL was announced. It includes several major features: Avro support, integration with the Confluent Schema Registry, the ability to convert between data formats, ability to join across different formats, and support for basic metrics.

Version 0.4.6 of Scio, the Scala library for Apache Beam, was released. It includes an update to the Apache Beam version, several other minor features, and a bunch of bug fixes.

Apache HBase 1.4.0 was released. It includes over 660 issues. Major features include a new shaded client that should improve compatibility, improvements to the rest client, enhanced autorestart capabilities, and improvements to RegionServer metrics.

Version 2.0 of Ampool was released. It includes security enhancements, column statistics, new file formats, and more. Also, a few weeks back, the Ampool Active Data Store was released to AWS marketplace.


Curated by Datadog ( )


#ApacheKafkaTLV Hosting Gwen Shapira (Ramat Gan) - Wednesday, December 27


Discussion and Open Space for Emerging Big Data and Analytics Technologies (Singapore) - Wednesday, December 27