Data Eng Weekly

Data Eng Weekly Issue #299

27 January 2019

Several technologies that are less commonly covered in the newsletter in this week's issue. Topics include Logstash on Kubernetes, the Alpakka stream processing framework, experiences with Bigtable, and SQL on Apache Beam. There are also several new releases (including a couple of newly open sourced) this week, and a couple of good posts on general distributed systems, including the two-phase commit protocol.


This article shows how to run PySpark in a docker container with a Jupyter notebook for querying data. There's a tutorial of analyzing and visualizing open government data.

This post walks through deploying Logstash with a Filebeat to ingest logs for pods running on Kubernetes.

Alpakka is a streaming platform built atop of Akka and Reactive Streams. This post describes why one company decided to use Alpakka rather than alternative stream processing frameworks. They cite the abilities to make async calls to external services and to parallelize steps in the pipeline as key features of Alpakka over Kafka Streams and Spark Streaming.

An overview of enriching data in an Apache Beam pipeline with data from an external RDBMS. It looks at some of the key primitives and a few different solutions (with their tradeoffs).

With the caveat that you should always benchmark your own workloads, this post has a performance and cost analysis of the Intel and AMD instance types for big data systems in AWS. They found that the Intel (r5) instances are faster than AMD (r5) across a few different benchmarks.

A good introduction to Apache Kafka that helps to emphasize key features and several things Kafka explicitly doesn't do. If you're a new user of the project, this post has a lot of good context.

The Ravelin team shares their experiences of using Google Bigtable for a year and a half. They migrated a large PostgreSQL database to Bigtable, and talk about latency, replication, and more once on Bigtable.

StreamSets has implemented encryption and decryption using the AWS Encryption SDK with keys from AWS Key Managed Service or other places (like Vault). This tutorial shows how to implement streaming encryption and decryption when sending to/reading from Amazon Kinesis.

THRON shares seven tips that they've learned working with Apache Spark in the cloud. They have recommendations on data skew, memory settings, storing data in Amazon S3 (both copying data in and file formats), and more. writes about how they've implemented their data platform using AWS Glue, Amazon Kinesis Data Firehouse, Amazon Athena, and other serverless technologies.

WePay writes about how they've implemented a high availability solution for MySQL using Github's Orchestrator, HAProxy, Consul, and pt-heartbeat. In Kubernetes, they're using HAProxy as a sidecar, and consul-template provides dynamic HAProxy configuration changes. The post has a detailed overview of how both planned and unplanned failure happen, and how well the solution has worked thus far to provide less than 1 minute of downtime during failover.

A good take on why Apache Kafka and Kafka Connect,when used at the center of an architecture, provide an escape hatch. For example, if you make the wrong bet on a storage backend, you can send data from Kafka to a new backend to migrate away.

An overview of using Helm charts to deploy the Confluent Platform on Amazon EKS. The post includes commands to get a shell on various pods to execute commands that help explore and debug.

A look at the new features in Apache Airflow 1.10.2, including improvements to support for sensors, theming of the RBAC UI, and more.

Apache Beam now supports SQL via the Java SDK, which is explored in this article that runs pipelines with the Google Cloud Dataflow backend.

Qubole writes about an optimization they've implemented for Apache Spark when writing data to Amazon S3. The so-called "Direct Writes" avoid a final copy as part of writing to an Apache Hive table, which is up to 40x faster in some of their experiments.

This post has a great introduction to two-phase commit, which is commonly used for transactions in distributed systems. The author argues that its complexity and problems are due to a basic assumption that needs to be rethought, and he then describes a potential replacement that simplifies implementation.

This series, which is in progress, is detailing the implementation of a distributed resource allocation library. The first two parts are live covering the basic requirements and the design (such as failure scenarios) and invariants.


Big Data Tech Warsaw 2019 has announced some of the speakers for the upcoming conference, which takes place on February 27th.

Datanami has coverage of a new threat to Apache Hadoop, Redis, and ActiveMQ clusters. Attackers are brute-forcing weak passwords of exposed services.

Microsoft has acquired Citus Data, the makers of a Postgres extension for running the database on multiple nodes.


Apache Hadoop 3.2.0 was released. Highlights of the release include support for the latest Azure Datalake storage, enhanced S3 performance, Hadoop Submarine for deep learning, and a storage policy satisfier to move blocks amongst storage types.

StreamSets and Snowflake announced a new partnership and a new StreamSets for Snowflake integration.

PingCap has open sourced DM, the data migration platform. The 1.0.0-alpha version focuses on migrating data (including via incremental migration) from MySQL/MariaDB to TiDB.

Version 1.2.1 of Stream Reactor, which provides Kafka Connect connectors, has been released. The new release adds support for Hive source and sink (including auto creating Hive tables, schema migration, and Parquet+ORC file formats). There are also a number of fixes and new features to the over 25 other connectors.

Version 3.1 of Dremio, the Data-as-a-Service platform, has been released. This release adds multitenant workload controls, performance improvements via the Gandivia (Apache Arrow integration) initiative, and more.

Frafka implements Frizzle, which is a message bus for parallel processing using goroutines, atop of Kafka. It's a new open source project with a v0.0.1 release.


Curated by Datadog ( )



Apache Hadoop Contributors Meetup (Mountain View) - Wednesday, January 30


Cleveland Big Data Meetup (Cleveland) - Monday, January 28


Self-Tuning Data Systems (Jacksonville) - Tuesday, January 29

New York

Druid NYC Meetup @ Datadog (New York) - Tuesday, January 29


January Presentation Night @ WB Games Boston (Needham) - Thursday, January 31


Accelerating Data to Gain Insights! (Stockholm) - Wednesday, January 30


Evolution Gaming: Event Sourcing Using the Kafka-Journal (Riga) - Wednesday, January 30


Speeding Up PySpark with Arrow, by Ruben Berenguel (Barcelona) - Thursday, January 31


January Data Engineering Meetup @ Atlassian (Sydney) - Thursday, January 31

Data Engineering Melbourne Meetup (Melbourne) - Thursday, January 31