Data Eng Weekly Issue #314

16 June 2019

A couple of new interesting tools, online course materials for distributed systems, and several posts on data formats (including a look at performance of HDFS erasure codings). Also a look at the Apache Flink network stack, a new Jespen analsyis, and how Apache Kafka is used for astronomy. Should be something for everyone!

Technical

Kedro is a new data and development workflow framework that implements best practices for data pipelines with an eye towards productionizing machine learning models. Kedro has an interesting SDK for working with datasets, and it has an integration with PySpark. There are some great docs for the project including several tutorials.

https://medium.com/@QuantumBlack/introdqucing-kedro-the-open-source-library-for-production-ready-machine-learning-code-d1c6d26ce2cf

The Apache Flink blog has a thorough design overview of the netty and Akka-based network stack. For Flink 1.5, they've implemented credit-based flow control for multiplexed channels. The architecture includes some clever backpressure handling to keep latency low while providing high throughput.

https://flink.apache.org/2019/06/05/flink-network-stack.html

Debezium has started a community newsletter highlighting articles, releases, and Q&A related to their change data capture system. Lots of good stuff to read in here.

https://debezium.io/blog/2019/06/05/debezium-newsletter-01-2019/

The Cloudera blog has a post on HDFS erasure coding, comparing performance using various benchmarks (e.g. terasort and word count) between data stored using replication and erasure encoding. They also describe failure recovery, which takes a bit longer for erasure coding. The post also covers several factors that can impact peformance like data locality, block size, and Intel's hardware acceleration.

https://blog.cloudera.com/blog/2019/06/hdfs-erasure-coding-in-production/

This article describes a really interesting astronomy use case for Apache Kafka. After analyzing images of the night sky, events about changing objects (at the rate of ~1 million per night) are published to Kafka for analysis downstream by researchers.

https://www.confluent.io/blog/streaming-data-from-the-universe-with-apache-kafka

Another excellent Jepsen post, this time on TiDB. There are a lot of interesting details about failure scenarios and edge cases—proving once again that distribution systems are complex. I particularly appreciated the discussion of consistency models that describes how TiDB's compare to MySQL's.

https://jepsen.io/analyses/tidb-2.1.7

PingCAP, the creators of TiDB, have an online self-directed course for writing distributed system in Go and Rust. They provides interfaces and test cases for a number of coding exercises. Seems like a great way to learn first hand.

https://github.com/pingcap/talent-plan

A look at one organization's use of Apache Parquet (to replace JSON) with Presto that resulted in significant speed ups. The post includes example code for producing Parquet in Python as well as some tips for getting the python library to produce a data file that works well with Presto.

https://medium.com/when-i-work-data/por-que-parquet-2a3ec42141c6

Stein is a Node.js application that presents a RESTful API based on a dataset stored in a Google Sheet. For simple projects, this could be a straightforward way to stand up a prototype.

https://github.com/SteinHQ/Stein

The SSENSE blog has a look at the trade-offs between CSV, Parquet, and Avro data formats. The author shares some benchmarking results which lead them to Avro for their use case. And it may be of interest to some folks—all the code samples are JavaScript in this post.

https://medium.com/ssense-tech/csv-vs-parquet-vs-avro-choosing-the-right-tool-for-the-right-job-79c9f56914a8

Events

Curated by Datadog ( http://www.datadog.com )

California

Overview of Microsoft Data Platform Offerings (San Francisco) - Monday, June 17
https://www.meetup.com/San-Francisco-Bay-Area-Microsoft-BI-User-Group/events/261263521/

Data Integration with Kafka: What, Why, How (Irvine) - Wednesday, June 19
https://www.meetup.com/Orange-County-Advanced-Analytics-Meetup/events/260293249/

WeWork's Use of Apache Kafka On-Prem + Rebalance Protocol Inside-Out (San Francisco) - Thursday, June 20
https://www.meetup.com/KafkaBayArea/events/261932534/

Georgia

Chick-Fil-A Spark Use Case for Data & Analytics (Roswell) - Thursday, June 20
https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/262216373/

Virginia

Implementing Distributed Tracing Like a Boss in Your Apache Kafka Deployments (Reston) - Tuesday, June 18
https://www.meetup.com/Apache-Kafka-DC/events/261833376/

CANADA

Spark in Kubernetes (Toronto) - Monday, June 17
https://www.meetup.com/tordatascience/events/262137237/

BRAZIL

5th Data Engineering Meetup (Belo Horizonte) - Tuesday, June 18
https://www.meetup.com/engenharia-de-dados/events/262063862/

UNITED KINGDOM

Apache Cassandra & Apache Kafka Workshop (London) - Thursday, June 20
https://www.meetup.com/Open-Source-Cassandra/events/261731622/

NORWAY

From Zero to Hero with Kafka Connect (Oslo) - Tuesday, June 18
https://www.meetup.com/Oslo-Kafka/events/261436304/

FRANCE

ParisDataEng: Stream Data Processing (Paris) - Thursday, June 20
https://www.meetup.com/Paris-Data-Engineers/events/260694777/

GERMANY

Apache Cassandra & Apache Kafka Workshop (Berlin) - Tuesday, June 18
https://www.meetup.com/Distributed-Data-Berlin/events/261731573/

Introduction to Knative + Kafka on Kubernetes (Berlin) - Tuesday, June 18
https://www.meetup.com/jug-bb/events/261892827/

NETHERLANDS

Topic Management at Scale (Amsterdam) - Tuesday, June 18
https://www.meetup.com/Amsterdam-Kafka-Meetup/events/261436732/

SWITZERLAND

Kafka and Its Friends (Bern) - Thursday, June 20
https://www.meetup.com/Messaging-Streaming-Switzerland/events/260880224/

ITALY

Getting Started with Spark & Cassandra + Using pySpark with Google Colab (Milano) - Wednesday, June 19
https://www.meetup.com/Spark-More-Milano/events/262201072/

POLAND

Big Data v 4.0 (Warszawa) - Tuesday, June 18
https://www.meetup.com/Big-Data-Warsaw/events/261611518/

INDIA

Kafka Meetup #9 (Bengaluru) - Monday, June 17
https://www.meetup.com/Bangalore-Apache-Kafka-Group/events/261609965/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.