Data Eng Weekly

Data Eng Weekly Issue #258

01 April 2018

Lots to choose from this week: Stream processing, Apache Airflow, AWS compartmentalization, distributed systems, a "partition index" strategy for speeding up analytics queries, and more. There are also a couple of interesting articles reflecting on the current state of the Hadoop/big data ecosystem, and several releases (including Apache Kafka 1.1.0 and the Data Artisan's dA Platform).


In a streaming architecture, you want to avoid joining streaming data with an external data source if at all possible. This post describes some approaches to avoiding the join by creating an in-memory index of the external data set. It discusses the implementation and performance characteristics of two of these—a trie and a bloom filter.

The Starburst blog has a post on adding a cost-based optimizer to Presto. With the new optimizer, they're seeing up to 15x throughput improvements thanks to join reordering and choosing the best join distribution algorithm.

Anna is a "key value store for any scale" with design goals of partitioning, multi-master, wait-free execution, and coordination-free consistency. The Morning Paper covers the paper from the IEEE International Conference on Data Engineering which has an overview of the architecture, design and implementation, and a performance evaluation.

Amazon CTO Werner Vogels has written about the evolution of compartmentalization at AWS—regions, Availability Zones, and AWS HyperPlane. The post discusses the AWS design goals and application design patterns that make use of the regions and AZs.

The data platform at MakeMyTrip uses Airflow and a custom system to ship data from Apache Kafka into Amazon S3 for querying by Presto. This post describes the custom systems' design goals (several driven by gaps in the Camus, the system that they previously used), its architecture, MakeMyTrip's batch and near-realtime data pipelines, their data storage (with Parquet) and partitioning strategies, and their alerting procedures.

This primer on distributed systems covers the two generals problem, FLP impossibility, failure modes, and more. It's a good intro (or refresher) to these distributed systems/consensus topics.

With the ephemeral nature of Kubernetes applications, the classic mechanism of ssh'ing to a server to search logs is less realistic. This post looks at using fluent-bit and fluentd to collect and store logs from a Kubernetes cluster.

An end-to-end example (including code on github) of building an analytics pipeline using the Google Cloud Platform stack (PubSub, DataFlow, Cloud Storage).

Pinterest writes about improvements to their HBase backup strategy. They've implemented direct-to-S3 backups and a deduplicator, which has reduced storage footprint by as much as 137x.

When you're querying data across a file system (e.g. with Hive, Impala, Spark), the query engine can use partition pruning to avoid scanning all of the data. But if there isn't a partition predicate in your query, then you're left with a full table scan. This post describes a "partition index" strategy for indexing in which the system maintains a ledger of all partitions containing a particular column/value and modifies inbound queries to specify only those partitions. If you're doing a lot of needle-in-the-haystack queries, this could make a major improvement to your system's query throughput.


DataEngConf is in two weeks in San Francisco. The lineup has a great mix of startup, open source, software vendors, and larger companies.

This post argues that Kubernetes is the future of big data. There are both technical arguments (Kubernetes is a great equalizer across all cloud services) and business ones (vendor support has more momentum than YARN).

With Hadoop 3.0's introduction of erasure codings for HDFS, this author observes that Hadoop community is embracing the decoupling of storage and compute.

SiliconANGLE has an interview with Hortonworks CEO Rob Bearden, in which he discusses the strength of their Hadoop business, his bearishness on the cloud (2/3rds of their revenue is from on-prem), as well as thoughts on the evolution of the big data ecosystem over the past several years.

The CFP for HBaseCon, which takes place this June in San Jose, is open through April 16th.


Several weeks ago, Druid 0.12.0 was released with a performance and stability improvements, bug fixes, enhancements to SQL and auth, and more.

The Ambari Airflow Mpack is a new package for deploying Apache Airflow (incubating) via Apache Ambari. This post introduces it and describes the deployment process. Code is on github.

Alfred is a new open source "data butler" that provides business users with the ability to upload data to a sandbox, promote it to a production data set, and build derivative datasets. Alfred is built with Spring Boot for the HTTP services and React for the front end. Alfred is Apache 2.0-licensed, and there's a quickstart VM to try it out.

Apache Accumulo 1.7.4 has been released with fixes to the upgrade process, a guava upgrade, and a total of nearly 50 other fixes/improvements.

Data Artisans has announced general ability of dA Platform, which includes Apache Flink for stream processing. They have a trial sandbox VM and a trial distro for running on a Kubernetes cluster.

Version 2.3.1 of Apache Kylin, the OLAP engine built on Hadoop, has been released. It includes 12 fixes and enhancements.

Starbust Enterprise Distribution of Presto, version 195e, has been released with the above-mentioned cost-based optimizer for Spark.

Apache Kafka 1.1.0 is out with several new features and improvements, including more efficient replication with large numbers of partitions, intra-node disk balancing, support for some dynamic configs, and several updates to Kafka Connect and Kafka Streams.

BlazingDB has announced native support for Apache Parquet in their GPU-accelerated SQL engine.

Interested in sponsoring? See: for details and email


Curated by Datadog ( )



IOT Architecture (St. Louis) - Wednesday, April 4


Apache Spark Maryland 2018 Kickoff! (Hanover) - Thursday, April 5


How Tivo Uses Presto (Boston) - Thursday, April 5


SQL-on-Anything with Presto, and Low-Latency SQL with Druid (Amsterdam) - Tuesday, April 3


AWS Big Data with Dan O'Brien (Auckland) - Thursday, April 5