Data Eng Weekly


Data Eng Weekly Issue #307

31 March 2019

Over the years, the topics covered in this newsletter have grown from the Apache Hadoop ecosystem to Data Engineering and Distributed Systems to various adjacent topics. I've followed the content where it's interested me, and emails have gotten quite long trying to encompass all the relevant news, technical posts, and releases. So I'm trying something new with this weeks issue—I'm slimming down the newsletter and choosing the best articles across all of those sections. Going forward, expect to see about 10 links each week. For everyone who has sent an article my way—thanks! Please continue doing so but don't take it personally if I don't end up including something you pass along.

Technical

Amazon Web Services published two articles in the past couple of years about their hosted database, Amazon Aurora. The Morning Paper summarized some of the key topics covered in those paper, including how Aurora decouples its storage tier, how the design considers Availability Zones for quorum writes, the log system maintained by the storage tier, and how all of these achieve fault tolerance and high throughput. While as an AWS user, something Aurora might be a black box, it's really interesting to see how the folks at AWS design a distributed system to thrive in AWS.

https://blog.acolyer.org/2019/03/25/amazon-aurora-design-considerations-for-high-throughput-cloud-native-relational-databases/
https://blog.acolyer.org/2019/03/27/amazon-aurora-on-avoiding-distributed-consensus-for-i-os-commits-and-membership-changes/

There's a common tension in many organizations between standardization and autonomy. At Netflix, they have lots of teams using lots of different tools, which means there isn't much standardization to take advantage of when building a data lineage system. Netflix writes about how pull metadata from many different sources (including Lipstick to gather information about Apache Pig scripts, the Meson scheduler API, and Amazon S3 access logs) to reconstruct lineage and enrichment data about their data pipeline. The metadata ultimately lands in a graph database, which is frontend by a REST API.

https://medium.com/netflix-techblog/building-and-scaling-data-lineage-at-netflix-to-improve-data-infrastructure-reliability-and-1a52526a7977?source=rss----2615bd06b42e---4

Cadence is a workflow and orchestration tool with a Golang SDK. This tutorial shows how to author your first workflow with Cadence. There are some minor changes, like using the SDK's routines and channels rather than the go-native versions. Cadence uses Cassandra for storage, has a web UI for visualizing workflows, and has a server/worker architecture that integrates well with Kubernetes and similar tools.

https://banzaicloud.com/blog/introduction-to-cadence/

Version 0.9.3 of Debezium, the chance data capture tool for databases, was released. While it's a minor release, it includes an implementation of an EventRouter for the outbox pattern. If you aren't familiar with the outbox pattern, there's a link to a blog post about it in the release notes—it's a clever design pattern for publishing events about your system that seems quite promising.

https://debezium.io/blog/2019/03/26/debezium-0-9-3-final-released/

Distributed tracing tools offer great support for request tracing in HTTP or RPC services. But you may not realize they can also be used for tracing outside of the request path for systems like Apache Kafka. This post describes how to use Kafka's interceptors to add tracing data to Kafka Connect, KSQL, and other Kafka producers/consumers.

https://www.confluent.io/blog/importance-of-distributed-tracing-for-apache-kafka-based-applications

Apache Kafka 2.2 was released this week. There are a lot of new features in the release, which are summarized in this video+screencast. I really enjoyed this one (it's a great length at under 15 minutes), and I'm not normally a video person.

https://www.youtube.com/watch?v=kaWbp1Cnfo4

Dream11 (a fantasy sports website in India) tells the story of scaling their real-time fraud engine. They started with a relational database, and they introduced change-data-capture using Debezium to send data to Apache Cassandra. Their post describes their requirements (e.g. they have large bursts during the World Cup and other sporting events) and what other tools they use and have evaluated.

https://medium.com/@D11Engg/data-is-the-new-oil-mine-it-well-to-tap-your-customers-in-real-time-e01803c321b0

CDNs are easy to take for granted these days—if you're not familiar (or need a refresher) then this is a great intro to how they work. This implementation uses Route53's latency-based routing (which sends a request to the closest region), Kubernetes' ExternalDNS, and Terraform to spin up Kubernetes clusters in multiple AWS regions.

https://blog.insightdatascience.com/how-to-build-your-own-cdn-with-kubernetes-5cab00d5c258

I've often heard of talk of mirroring data from a production database to a lower environment for testing, but it's often hard to follow through on that. This post describes how to use Google Cloud Functions with Cloud Scheduler to run a script that does the mirroring across BigQuery datasets, which can achieve that goal.

https://medium.com/@phil.goerdt/dynamically-duplicating-a-bigquery-datasets-tables-9a1c7ae299f9

Events

Curated by Datadog ( http://www.datadog.com )

California

Apache Druid Bay Area Meetup @ Unity Technologies (San Francisco) - Tuesday, April 2
https://www.meetup.com/druidio/events/257291633/

Washington

Exactly Once Processing with Flink and Pravega (Seattle) - Thursday, April 4
https://www.meetup.com/seattle-flink/events/260129217/

Arizona

Building Zero Data Loss Pipelines with Apache Kafka (Tempe) - Wednesday, April 3
https://www.meetup.com/Phoenix-Hadoop-User-Group/events/258763503/

Oklahoma

Big Data Tools: DataBricks (Oklahoma City) - Thursday, April 4
https://www.meetup.com/Big-Data-in-Oklahoma-City/events/258171327/

Missouri

Seasons of Streaming (Saint Louis) - Wednesday, April 3
https://www.meetup.com/St-Louis-Big-Data-IDEA/events/257350381/

Illinois

Trends Affecting System Architectures: Streaming Data, Machine Learning, Etc. (Chicago) - Tuesday, April 2
https://www.meetup.com/goto-nights-chicago/events/259429841/

Operationalizing Kafka in the Cloud, Benchmarks, and SpotHero's Uses (Chicago) - Thursday, April 4
https://www.meetup.com/Chicago-Area-Kafka-Enthusiasts/events/259766741/

New York

Apache Kafka and Kubernetes Day (New York) - Wednesday, April 3
https://www.meetup.com/Apache-Kafka-NYC/events/259594388/

Is Apache Kafka and IBM Event Streams Right for Your Event-Streaming Needs? (New York) - Wednesday, April 3
https://www.meetup.com/ibmcodenyc/events/260118929/

Stream Processing Beyond Streaming with Apache Flink, by Co-Creator of Flink (New York) - Thursday, April 4
https://www.meetup.com/mysqlnyc/events/259567701/

CANADA

Introduction to Apache Spark Hands-On (Toronto) - Tuesday, April 2
https://www.meetup.com/tordatascience/events/260164077/

UNITED KINGDOM

Apache Beam Meetup 6: Detecting Fraud @ Revolut & Tech Deep Dive (London) - Wednesday, April 3
https://www.meetup.com/London-Apache-Beam-Meetup/events/259496271/

NETHERLANDS

Benchmarking Stream Processing Frameworks and Parquet Optimizations (Amsterdam) - Wednesday, April 3
https://www.meetup.com/Amsterdam-Spark/events/258923652/

GERMANY

Development of Event-Driven Systems in Apache Kafka (Frankfurt) - Tuesday, April 2
https://www.meetup.com/ObjektForum-Frankfurt/events/259466905/

Data Reliability Challenges with Apache Spark (Munich) - Wednesday, April 3
https://www.meetup.com/Hadoop-User-Group-Munich/events/259499217/

ITALY

Spark Structured Streaming + HBase Peers Part 2: CDC on Apache Kafka (Bologna) - Tuesday, April 2
https://www.meetup.com/Bologna-Big-Data-Meetup/events/258618358/

AUSTRIA

Microservices, Kafka Streams, and KafkaEsque (Wien) - Tuesday, April 2
https://www.meetup.com/Vienna-Kafka-meetup/events/259403482/

CZECH REPUBLIC

Zonky Kafka Meetup (Praha) - Tuesday, April 2
https://www.meetup.com/Zonky-DevOps/events/259262178/

ISRAEL

AWS Big Data Architecture Lessons Learned 1.2 (Tel Aviv-Yafo) - Monday, April 1
https://www.meetup.com/AWS-Big-Data-Demystified/events/258933220/

Kafka Streams Applications (Tel Aviv-Yafo) - Wednesday, April 3
https://www.meetup.com/ApacheKafkaTLV/events/259936651/

AUSTRALIA

Sydney Data Engineering Meetup (Sydney) - Thursday, April 4
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/259575677/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.