Data Eng Weekly

Data Eng Weekly Issue #294

23 December 2018

It's a busy issue this week as lots of folks published articles before the year end. A few of the must reads include how the Guardian migrated from MongoDB to Postgres, an analysis of UDFs/builtins/map functions in Apache Spark, Facebook's work on HyperLogLog for Presto, and Hortonworks' post on a Container Storage Interface for Apache Hadoop YARN. There are also a lot of interesting tools to checkout—KubeDB for running databases on Kubernetes, the discrETLy project that provides a business dashboard for Apache Airflow, and the Redix key-value store.


The Guardian writes about how they migrated from MongoDB to Postgres (storing data in JSONB). The post offers a glimpse at their rollout and operations of migrating data, including how they replicated API calls to execute them on both databases and how they used structured logging. Lots of useful details if you're ever considering a similar database swap.

The Facebook blog has a post about the HyperLogLog (HLL) implementation in Presto that is used for computing approximate distinct counts. The post has a good introduction to the math behind HLL and why it can achieve big speedups.

This post covers the various settings that can be used to tune the performance of an Apache Kafka cluster and its producers/consumers.

Apache Kafka 2.1 was released just over a month ago. This post has a good overview of all the new features and improvements in the release (there are quite a few!).

Hortonworks has a post comparing the performance of Hive in HDP 2.6.5 with that in 3.0. The version in 3.0 adds ACID semantics but also performs nearly 2x faster (in their TCP-DS analyses) thanks to dynamic runtime filtering and vectorization.

The Container Storage Interface (CSI) is an API standard for building pluggable storage volumes for distributed systems like YARN and Kubernetes. This post shows how to use HDFS, via the Apache Hadoop Ozone Object Store implementation and the csi-s3 project, as a CSI driver in YARN (or potentially other systems).

For batch / analytical workloads, there's a Hive to Kafka integration in the latest release of HDP. It's architected to integrate with Kafka's timestamps and offsets for time travel and other optimizations. The Hortonworks blog has a bunch more about the implementation.

The Streamlio blog describes how Apache Pulsar's architecture, which separates the storage and serving layers, contrasts to other streaming systems like Apache Kafka. Specific advantages include isolation of catch up reads from real-time tailing, and infinite storage by offloading cold data to a blob store like S3.

Apache Hive 3.0 includes improved support for federating queries to JDBC sources, such as the ability to apply filters and projections in the source system. Apache Calcite is used for the query planning and can intelligently use source database-specific SQL features.

Facebook writes about their usage of zstandard, the efficient compression library, for use cases like ORC files, databases, and storing data in RocksDB.

Cloudera and Intel collaborated on an implementation of vectorized reading of Apache Parquet data in Apache Hive. This post describes vectorization, the implementation, and the performance improvements that this work has realized (about 26% on average).

This article explores the various ways to implement custom logic in Apache Spark, such as UDFs, Map functions, and builtins. It analyzes the ease of implementation as well as the performance impact using some microbenchmarking. There's also a discussion of how each option impacts the query plan—for example can predicate pushdown still be used?

Lyfts Airflow deployment covers 500 DAGs and executes on nearly 20 executor nodes. They share some tips and lessons learned from running Airflow at this scale, including how they use a "canary" health check to measure end-to-end availability, how they've improved DAG execution performance, and their Airflow customizations.

A good collection of tips for getting the most performance out of Apache Spark.

An interesting analysis of how high availability in a stream processing engine tends to add a lot of operational complexity, especially if you have to bring ZooKeeper and YARN into the mix. The author suggests an alternative version of HA—running the stream pipeline twice—to avoid bloat for a smaller use case.


The end of the year finds lots of folks looking for a change! Post a job to the Data Eng Weekly job board for $99.


DataStax Accelerate is taking place this May in National Harbor, Maryland. The Call For Papers is open through February 15th.

Datanami has compiled a collection of big data predictions for 2019, including the adoption of ML by enterprises, changes to real-time data processing thanks to 5G networks, additional edge computing, and slowing of growth of Hadoop.

Confluent has written a follow up post about the new Confluent Community License to address some frequent questions.


KubeDB is a set of Kubernetes Custom Resource Definitions, CLI tools, and more for running databases on Kubernetes. It supports Postgres, Elasticsearch, MySQL, MongoDB, and several other systems. Version 0.9.0 was released this week.

discreETLy is a new open source project from Fandom. It's a dashboard that presents a human-friendly (e.g. for business owners) dashboard of your Apache Airflow workflows by consumes the Airflow API. It's stateless and can be run as a Docker container.

Version 3.0 of wolkenkit, the CQRS and event sourcing framework for Node.js, has been released. Major new features include a rewritten file storage, authorization and security features, and additional CLI commands.

Hadoop Submarine is a new project for building deep learning frameworks, like TensorFlow, on Apache Hadoop. It has integrations with Zeppelin and Azkaban for running jobs.

Cloudera Enterprise 6.1.0 is out, with updates across the entire suite of products. Their post has a summary of major features.

Apache Flink 1.7.1 was released with a couple dozen bug fixes.

Redix is a key-value store that implements the Redis protocol, persistence, and additional functionality like ACID transactions. Version 1.5 was just released.


Curated by Datadog ( )


#ApacheKafkaTLV Hosting Gwen Shapira (Tel Aviv-Yafo) - Tuesday, December 25

Fullstack 2018 Hackathon: Boost Your MSA & Data Pipelines with Kafka Streams (Tel Aviv-Yafo) - Tuesday, December 25


Building Microservices with Reactive Architecture (Pune) - Saturday, December 29