16 June 2019
A couple of new interesting tools, online course materials for distributed systems, and several posts on data formats (including a look at performance of HDFS erasure codings). Also a look at the Apache Flink network stack, a new Jespen analsyis, and how Apache Kafka is used for astronomy. Should be something for everyone!
Kedro is a new data and development workflow framework that implements best practices for data pipelines with an eye towards productionizing machine learning models. Kedro has an interesting SDK for working with datasets, and it has an integration with PySpark. There are some great docs for the project including several tutorials.
The Apache Flink blog has a thorough design overview of the netty and Akka-based network stack. For Flink 1.5, they've implemented credit-based flow control for multiplexed channels. The architecture includes some clever backpressure handling to keep latency low while providing high throughput.
https://flink.apache.org/2019/06/05/flink-network-stack.html
Debezium has started a community newsletter highlighting articles, releases, and Q&A related to their change data capture system. Lots of good stuff to read in here.
https://debezium.io/blog/2019/06/05/debezium-newsletter-01-2019/
The Cloudera blog has a post on HDFS erasure coding, comparing performance using various benchmarks (e.g. terasort and word count) between data stored using replication and erasure encoding. They also describe failure recovery, which takes a bit longer for erasure coding. The post also covers several factors that can impact peformance like data locality, block size, and Intel's hardware acceleration.
https://blog.cloudera.com/blog/2019/06/hdfs-erasure-coding-in-production/
This article describes a really interesting astronomy use case for Apache Kafka. After analyzing images of the night sky, events about changing objects (at the rate of ~1 million per night) are published to Kafka for analysis downstream by researchers.
https://www.confluent.io/blog/streaming-data-from-the-universe-with-apache-kafka
Another excellent Jepsen post, this time on TiDB. There are a lot of interesting details about failure scenarios and edge cases—proving once again that distribution systems are complex. I particularly appreciated the discussion of consistency models that describes how TiDB's compare to MySQL's.
https://jepsen.io/analyses/tidb-2.1.7
PingCAP, the creators of TiDB, have an online self-directed course for writing distributed system in Go and Rust. They provides interfaces and test cases for a number of coding exercises. Seems like a great way to learn first hand.
https://github.com/pingcap/talent-plan
A look at one organization's use of Apache Parquet (to replace JSON) with Presto that resulted in significant speed ups. The post includes example code for producing Parquet in Python as well as some tips for getting the python library to produce a data file that works well with Presto.
https://medium.com/when-i-work-data/por-que-parquet-2a3ec42141c6
Stein is a Node.js application that presents a RESTful API based on a dataset stored in a Google Sheet. For simple projects, this could be a straightforward way to stand up a prototype.
https://github.com/SteinHQ/Stein
The SSENSE blog has a look at the trade-offs between CSV, Parquet, and Avro data formats. The author shares some benchmarking results which lead them to Avro for their use case. And it may be of interest to some folks—all the code samples are JavaScript in this post.
Curated by Datadog ( http://www.datadog.com )
Overview of Microsoft Data Platform Offerings (San Francisco) - Monday, June 17
https://www.meetup.com/San-Francisco-Bay-Area-Microsoft-BI-User-Group/events/261263521/
Data Integration with Kafka: What, Why, How (Irvine) - Wednesday, June 19
https://www.meetup.com/Orange-County-Advanced-Analytics-Meetup/events/260293249/
WeWork's Use of Apache Kafka On-Prem + Rebalance Protocol Inside-Out (San Francisco) - Thursday, June 20
https://www.meetup.com/KafkaBayArea/events/261932534/
Chick-Fil-A Spark Use Case for Data & Analytics (Roswell) - Thursday, June 20
https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/262216373/
Implementing Distributed Tracing Like a Boss in Your Apache Kafka Deployments (Reston) - Tuesday, June 18
https://www.meetup.com/Apache-Kafka-DC/events/261833376/
Spark in Kubernetes (Toronto) - Monday, June 17
https://www.meetup.com/tordatascience/events/262137237/
5th Data Engineering Meetup (Belo Horizonte) - Tuesday, June 18
https://www.meetup.com/engenharia-de-dados/events/262063862/
Apache Cassandra & Apache Kafka Workshop (London) - Thursday, June 20
https://www.meetup.com/Open-Source-Cassandra/events/261731622/
From Zero to Hero with Kafka Connect (Oslo) - Tuesday, June 18
https://www.meetup.com/Oslo-Kafka/events/261436304/
ParisDataEng: Stream Data Processing (Paris) - Thursday, June 20
https://www.meetup.com/Paris-Data-Engineers/events/260694777/
Apache Cassandra & Apache Kafka Workshop (Berlin) - Tuesday, June 18
https://www.meetup.com/Distributed-Data-Berlin/events/261731573/
Introduction to Knative + Kafka on Kubernetes (Berlin) - Tuesday, June 18
https://www.meetup.com/jug-bb/events/261892827/
Topic Management at Scale (Amsterdam) - Tuesday, June 18
https://www.meetup.com/Amsterdam-Kafka-Meetup/events/261436732/
Kafka and Its Friends (Bern) - Thursday, June 20
https://www.meetup.com/Messaging-Streaming-Switzerland/events/260880224/
Getting Started with Spark & Cassandra + Using pySpark with Google Colab (Milano) - Wednesday, June 19
https://www.meetup.com/Spark-More-Milano/events/262201072/
Big Data v 4.0 (Warszawa) - Tuesday, June 18
https://www.meetup.com/Big-Data-Warsaw/events/261611518/
Kafka Meetup #9 (Bengaluru) - Monday, June 17
https://www.meetup.com/Bangalore-Apache-Kafka-Group/events/261609965/
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.