05 May 2019
Lots of great content this week from data platforms at Lyft and Facebook to a paper out of Google that argues for replacing memcached/redis with stateful services, to a new open source knowledge graph store built on Kafka out of eBay. There should be something for everyone in this issue, whether you're interested in data platforms, stream processing, or low-level performance optimizations.
Transactions in distributed systems are complicated, and it's sometimes useful to understand the protocol to ensure your implementation is adhering to the design. This post describes Apache Kafka's transaction layer by explaining why transaction ids are important and visualizing the zombie-fencing process.
In my experience, database schema migrations can either be automated and prone to unanticipated issues or applied manual and prone to error. SendGrid writes about how they've automated the schema migration process and introduced guardrails using skeema, their open source too for managing MySQL tables. They've been able to achieve the best of both worlds, and they're doing it for not-so-small databases.
This post, the third in a series, describes some of the more interesting parts of implementing a proxy that talks the Postgres protocol and forwards messages to the RediSQL engine. There's a good introduction to the protocol, some easy-to-read Python snippets, and a discussion of some of the trickier bits of getting the code working. All four parts are worth a read.
Facebook writes about usage of Presto for analytics, real-time dev/ad analytics (which has a 99.999% availability SLA), batch ETL, and more. There's a lot of interesting discussion of Prestos internals, including how its optimizer focuses on multitenancy/throughput and future plans to explore Graal to generate optimal machine code (e.g. for certain math operations). They include some benchmarks and other performance data points, including metrics for effectiveness of lazy loading from columnar data files for Batch ETL. Presto is used to query lots of data sources at Facebook depending on the use case.
Pretty neat look at optimizing low-level data deserialization in Presto's ORCFile reader. These small code changes result in 4-9% performance improvements. The post includes good details of the relevant JVM internals, too.
The Software Engineering Daily Podcast has an interview with Li Gao about the data platform at Lyft. They talk about a number of technologies that Lyft is using for stream processing (including Apache Kafka and Apache Flink) and ETL (Apache Spark on Kubernetes). It's a good interview (there's also a transcript if you prefer to read) that covers a lot of ground, including the evolution of their data platform over the past couple of years.
This paper presents a new abstraction, the linked, in-memory key-value store, which is a replacement for remote, in-memory key-value stores like Memcached and Redis. There's a lot to this paper, but the key idea is to move caches to the individual services (the return of stateful services) and pair them with an autosharder and a new library that works with the autosharder to relocate values during a resharding event and to provide data consistency. The authors present results that reflect impressive reductions in end-to-end latency for a service at Google.
Enjoyable article that compares Kafka-based microservices architecture with that of mining and manufacturing pipelines in the game Factorio. Lots of fun analogies and animations in the post help to explain concepts like a consumer group and hot partitions.
If you've been in data engineering long enough, you've seen a few generations of workflow engines come along, evolve, and ultimately fail to solve certain use cases. Prefect writes about the features of their engine, like dynamic workflows and passing data between tasks, that make for a new generation of workflow engine. For folks using Airflow, there is a bunch of useful discussion about where Apache Airflow tends to run into issues (better to anticipate those problems than be surprised by them down the road!).
eBay has open sourced Beam, a knowledge graph store. Beam implements a gRPC/HTTP API server that stores RDF-like data and supports SPARQL-like querying. The system is built on Apache Kafka for storage and implements a DiskView, that captures pairs (subject-predicate and predicate-object) in a RocksDB store. The API server implements a query planner, optimizer, and executor. Beam is open source, but it's not in production use.
Curated by Datadog ( http://www.datadog.com )
Data Eng Monthly Meetup (San Diego) - Thursday, May 9
Using Apache Cassandra and Apache Kafka to Scale Next-Gen Applications (Los Angeles) - Thursday, May 9
Streaming ETL with Apache Kafka and KSQL (Chicago) - Tuesday, May 7
Event-Driven Architecture with Apache Kafka (Kitchener) - Wednesday, May 8
From Batch to Near-Real-Time: A Journey of Our Transition to Kudu (Montreal) - Thursday, May 9
Via Varejo Taking Data from Legacy to a New World + Cloud on Apache Kafka (Sao Paulo) - Thursday, May 9
Future of Data Santiago (Las Condes) - Tuesday, May 7
Real Data Engineering (Oslo) - Monday, May 6
Apache Beam at Spotify + Beam at EQT + Tech Deep Dive (Stockholm) - Monday, May 6
DevOps Meets Data Pipelines (Helsingfors) - Tuesday, May 7
Kafka and Security: Do You Accept the Challenge? (Madrid) - Wednesday, May 8
Kafka Streaming (Stuttgart) - Tuesday, May 7
Free Data Flow for Traveler Information with Apache Kafka Streams (Frankfurt am Main) - Wednesday, May 8
Build Data-Parallel Processing Pipelines Using Apache Beam (Singapore) - Thursday, May 9
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.