Data Eng Weekly

Data Eng Weekly Issue #316

07 July 2019

Lots of great articles despite the short week in the US—several articles on the Kafka ecosystem as well as posts on Postgres, distributed tracing, Elasticsearch, event sourcing, ORC with Apache Spark, and chaos engineering.


hypopg is a tool for testing how a new index will impact your postgres query efficiency via EXPLAIN. It's a postgres extensions whose functions update metadata about indexes in memory (without actually creating them), which it then uses for each EXPLAIN you issue in the same session.

Datadog has shared a collection of tips related to running Apache Kafka. Their post covers changing the maximum message size, settings for unclean leader election, and consumer offset & segment retention on low-throughput topics. There's a good description of the design and mechanics of each of these items—definitely worth checking out whether you're new to Kafka or have been operating it for some time.

An interesting look at lessons learned (an important one being distinguishing between business process events and "message bus" events) while implementing an event sourcing infrastructure, and some of the design decisions they made to scale their system up (like JSONB rather than trying to maintain a relational schema).

kafka4s is a new open sourced functional programming library for Apache Kafka and Scala. It includes some nice features for working with a Schema Registry/Avro/case classes as well as producing and consuming messages using idiomatic Scala.

This piece describes the basics of distributed tracing, and where the UX of typical systems falls short (especially when analyzing complex distributed systems). There are some suggestions for better ways to visualize the data to provide actionable insights, including dynamic service graphs and service-centric views.

It's exciting times in the US, as our women's national team won the World Cup. A timely post looks at analyzing tweets and RSS data about the World Cup using Apache Kafka, KSQL, Kafka Connect, Google BigQuery, and more. Tweets are scored for sentiment using Google's Natural Language APIs (via a KSQL User Defined Function) and synced to Google Big Query using the Big Query connector.

Since Apache Spark 2.3, there's a native & vectorized Apache ORC file reader for up to 10x speedups over the previous implementation. This article shows how to use spark-bench (which seems like a handy tool for doing your own benchmarking with Spark!) to analyze the native vs. not native readers and compare the speed to that of Apache Parquet on a single node system.

Devartis writes about upgrading their Elasticsearch cluster. They were jumping across several versions so it was a major migration. The post covers the mechanics of the upgrade, which was done by running clusters with both versions in parallel with a hard cutover during a maintenance window. They also write about the improvements in index size, response time, and stability. If you're running an old version of Elasticsearch, it sounds like the upgrade is well worth it.

The morning paper has coverage of Netflix's paper on their Chaos Automation Platform. There are some interesting practical ideas in their paper—by treating chaos experiments like a/b tests, they can form hypotheses and test outcomes as they injecting faults. They also talk about their safeguards (such as only running experiments during business hours on weekdays) and why they don't see error rates as a good signal.


Curated by Datadog ( )


#SDBigData Meetup (San Diego) - Wednesday, July 10

Using Apache Kafka and Cassandra to Scale Next-Gen Apps (San Diego) - Thursday, July 11


What They Never Told You about SOA & How Stream Processing Helps (Austin) - Monday, July 8

New York

Accelerating Analytical Workloads for Public & Hybrid Clouds (New York) - Wednesday, July 10


Data Engineering: Use Cases and Career Panel Discussions (Toronto) - Monday, July 8


Lake Formation, Service Meshes, and Apache Kafka in AWS (Bogota) - Wednesday, July 10


Kafka: Availability and Disaster Recovery (Paris) - Tuesday, July 9


From Zero to Hero with Kafka Connect (Nieuwegein) - Thursday, July 11


Kafka Intro + Data Streaming in the Cloud with Kafka (Dortmund) - Monday, July 8

Building a Streaming ETL Solution with Rail Data (Stuttgart) - Wednesday, July 10


NiFi & Darwin (Bologna) - Tuesday, July 9


Umuzi Academy Does Data + Keep Cool with Airflow (Johannesburg) - Thursday, July 11


First Apache Airflow Melbourne Meetup (Richmond) - Wednesday, July 10

Large-Scale Data Processing with Spark and Hazelcast (Canberra) - Thursday, July 11

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.