16 June 2019
A couple of new interesting tools, online course materials for distributed systems, and several posts on data formats (including a look at performance of HDFS erasure codings). Also a look at the Apache Flink network stack, a new Jespen analsyis, and how Apache Kafka is used for astronomy. Should be something for everyone!
Kedro is a new data and development workflow framework that implements best practices for data pipelines with an eye towards productionizing machine learning models. Kedro has an interesting SDK for working with datasets, and it has an integration with PySpark. There are some great docs for the project including several tutorials.
The Apache Flink blog has a thorough design overview of the netty and Akka-based network stack. For Flink 1.5, they've implemented credit-based flow control for multiplexed channels. The architecture includes some clever backpressure handling to keep latency low while providing high throughput.
Debezium has started a community newsletter highlighting articles, releases, and Q&A related to their change data capture system. Lots of good stuff to read in here.
The Cloudera blog has a post on HDFS erasure coding, comparing performance using various benchmarks (e.g. terasort and word count) between data stored using replication and erasure encoding. They also describe failure recovery, which takes a bit longer for erasure coding. The post also covers several factors that can impact peformance like data locality, block size, and Intel's hardware acceleration.
This article describes a really interesting astronomy use case for Apache Kafka. After analyzing images of the night sky, events about changing objects (at the rate of ~1 million per night) are published to Kafka for analysis downstream by researchers.
Another excellent Jepsen post, this time on TiDB. There are a lot of interesting details about failure scenarios and edge cases—proving once again that distribution systems are complex. I particularly appreciated the discussion of consistency models that describes how TiDB's compare to MySQL's.
PingCAP, the creators of TiDB, have an online self-directed course for writing distributed system in Go and Rust. They provides interfaces and test cases for a number of coding exercises. Seems like a great way to learn first hand.
A look at one organization's use of Apache Parquet (to replace JSON) with Presto that resulted in significant speed ups. The post includes example code for producing Parquet in Python as well as some tips for getting the python library to produce a data file that works well with Presto.
Stein is a Node.js application that presents a RESTful API based on a dataset stored in a Google Sheet. For simple projects, this could be a straightforward way to stand up a prototype.
Curated by Datadog ( http://www.datadog.com )
Overview of Microsoft Data Platform Offerings (San Francisco) - Monday, June 17
Data Integration with Kafka: What, Why, How (Irvine) - Wednesday, June 19
WeWork's Use of Apache Kafka On-Prem + Rebalance Protocol Inside-Out (San Francisco) - Thursday, June 20
Chick-Fil-A Spark Use Case for Data & Analytics (Roswell) - Thursday, June 20
Implementing Distributed Tracing Like a Boss in Your Apache Kafka Deployments (Reston) - Tuesday, June 18
Spark in Kubernetes (Toronto) - Monday, June 17
5th Data Engineering Meetup (Belo Horizonte) - Tuesday, June 18
Apache Cassandra & Apache Kafka Workshop (London) - Thursday, June 20
From Zero to Hero with Kafka Connect (Oslo) - Tuesday, June 18
ParisDataEng: Stream Data Processing (Paris) - Thursday, June 20
Apache Cassandra & Apache Kafka Workshop (Berlin) - Tuesday, June 18
Introduction to Knative + Kafka on Kubernetes (Berlin) - Tuesday, June 18
Topic Management at Scale (Amsterdam) - Tuesday, June 18
Kafka and Its Friends (Bern) - Thursday, June 20
Getting Started with Spark & Cassandra + Using pySpark with Google Colab (Milano) - Wednesday, June 19
Big Data v 4.0 (Warszawa) - Tuesday, June 18
Kafka Meetup #9 (Bengaluru) - Monday, June 17
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.