Data Eng Weekly


Data Eng Weekly Issue #305

17 March 2019

Topics this week include OpenTracing, structured logging, scaling Apache Airflow, dbt for ETL, joins in Apache Spark, and the new Quarkus framework. Quite a lot of variety—hopefully someone for everyone!

Technical

Sematext has a five part series on OpenTracing, including basics and terminology (like Spans and Baggage) and an overview of two implementations: Jaegar and Zipkin.

https://sematext.com/blog/opentracing-distributed-tracing-emerging-industry-standard/

This post provides an introduction to the features of dbt, a tool for efficiently building and executing ETLs. One of the more exciting features a mechanism for testing schema and data for SQL queries.

http://tamaszilagyi.com/blog/2019/2019-03-05-dbt/

Grab writes about the advantages of structured logging (including better root cause analysis, better observability, and better standardization), and the libraries and tools they're using for their structured logging stack. Among them, they have a fixed schema for their structured log records (with a library that can code generate an API for producing the messages).

https://engineering.grab.com/structured-logging

A good list of design principles for building systems using an event driven architecture. Several principles describe ways to be defensive in producing data and others focus on designing for backwards compatibility.

https://rjzaworski.com/2019/03/7-commandments-for-event-driven-architecture

Astronomer has a discussion of the trade-offs of running a single multi-tenant Airflow deployment vs. running multiple deployments that are divided up by functional area. In their experience, multiple deployments is typically the way to go.

https://www.astronomer.io/blog/airflow-infrastructure/

The Apache Flink blog has a post describing how to use Prometheus to monitor Flink jobs.

https://flink.apache.org/features/2019/03/11/prometheus-monitoring.html

This tutorial shows how to mirror data from Google BigQuery to Amazon S3 and import it into AWS Athena. The post breaks it down into 6 steps, and the last 4 (which cover extracting Apache Avro schemas and creating Apache Hive tables) are useful even if you're working with Avro data in S3 without using BigQuery at all.

https://medium.freecodecamp.org/how-to-import-google-bigquery-tables-to-aws-athena-5da842a13539

Hue has a neat new feature for visualizing and debugging Impala queries. It shows the execution DAG and a detail pane provides more information about what is happening at each step.

http://gethue.com/self-service-impala-sql-query-troubleshooting/

This post describes the various types of join in Apache Spark (e.g. hash join, merge join), when each is used, and some configuration that can be adjusted to optimize join performance.

https://medium.com/@achilleus/https-medium-com-joins-in-apache-spark-part-3-1d40c1e51e1c

This article covers how Apache Kafka Connect handles and surfaces errors, including by writing messages to a dead leader queue (optionally with Kafka Message headers containing details about the error) for reprocessing.

https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues

The Debezium blog has a post on writing applications to consume data from Apache Kafka using Quarkus, which is a new a new framework for building Kubernetes-native applications (including, using GraalVM, native binaries with very fast startup times). Quarkus' Kafka integration uses a reactive messaging API from the MicroProfile framework, which the code samples demonstrate.

https://debezium.io/blog/2019/03/14/debezium-meets-quarkus/

The Alphahealth team has an overview of validating, evolving, and anonymizing JSON data. The post references a number of tools, including JSON Schema, JSON Schema Validator, JSLT Expressions (for transforming data between two schemas), and more. There's some example code for each of these and discussion of how they deploy the code with Lambda.

https://medium.com/alphahealth/the-datum-vea-validate-evolve-and-anonymize-your-data-with-data-schemas-df494f74e16c

Jobs

Senior Data Engineer (Spark), N26, Berlin https://jobs.dataengweekly.com/jobs/c202d274-c3df-4274-9ffb-77a749db5c3f

Software Engineer - Data Platform, Fitbit, San Francisco, CA https://jobs.dataengweekly.com/jobs/ab98f336-41d9-455f-b0a7-f9d5975e5975

News

DataOps Summit is a new conference from the StreamSets team that will take place in September. The Call For Speakers is open until April 5th.

https://streamsets.com/blog/join-us-at-the-first-annual-dataops-summit/

Releases

Version 1.9.0 of Apache Kudu was released. It includes location awareness, a number of new CLI tools, docker scripts, and improved testing utilities. It also has a number of optimizations, improvements, and fixes.

https://kudu.apache.org/releases/1.9.0/docs/release_notes.html

YugaByte DB, the distributed SQL system, released version 1.2.0. It has several new features for SQL and transaction support, new datatypes, and more.

https://github.com/YugaByte/yugabyte-db/releases/tag/v1.2.0

Astronomer announced version 0.8.0 of their platform, which includes Apache Airflow 1.10.2 and a number of updates to their Kuberenetes-based platform.

https://www.astronomer.io/blog/astronomer-v0-8-0-release/

Events

Curated by Datadog ( http://www.datadog.com )

California

Stream Processing with Apache Kafka & Apache Samza (Sunnyvale) - Wednesday, March 20
https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/259437388/

Druid & Kafka: Swimming in the Data River, Or, When “Streaming Analytics” Isn’t (San Francisco) - Wednesday, March 20
https://www.meetup.com/San-Francisco-Bay-Area-Big-Data-and-Scalable-Systems/events/259177099/

Oregon

Deploying Kafka Streams Applications with Docker and Kubernetes (Portland) - Wednesday, March 20
https://www.meetup.com/PDXJUG/events/258110907/

Washington

Confluent and Nordstrom Discuss Events, Logs, and Microservices! (Seattle) - Tuesday, March 19
https://www.meetup.com/seattle-event-driven/events/258339551/

IRELAND Data Engineering Workshop (Dublin) - Saturday, March 23
https://www.meetup.com/Data-Science-and-Engineering-Club/events/259156071/

SPAIN

How to Contribute to Spark 3.0? With Holden Karau (Barcelona) - Tuesday, March 19
https://www.meetup.com/Spark-Barcelona/events/259091020/

FRANCE

Kafka, the Power of Events and Unbounded Data (Paris) - Tuesday, March 19
https://www.meetup.com/Paris-Apache-Kafka-Meetup/events/259416039/

Kafka Night (Toulouse) - Tuesday, March 19
https://www.meetup.com/Toulouse-Java-User-Group/events/258895990/

NETHERLANDS

A Spring with Stream Processing and Apache Flink (Utrecht) - Wednesday, March 20
https://www.meetup.com/ITNEXT/events/258261295/

Apache Kafka, KSQL, Demos & Booking.com (Amsterdam) - Thursday, March 21
https://www.meetup.com/Amsterdam-Kafka-Meetup/events/259342509/

GERMANY

Flink @ Workday + Image Processing with Flink (Munich) - Thursday, March 21
https://www.meetup.com/Apache-Flink-Meetup-Munich/events/259298662/

Open Source Infrastructure (Berlin) - Thursday, March 21
https://www.meetup.com/Zalando-Tech-Events-Berlin/events/259356839/

POLAND

SQL Over Hadoop: Designing, Developing, and Supporting (Wrocław) - Tuesday, March 19
https://www.meetup.com/25th-Level-Code-Wroclaw/events/259559254/

INDIA

Kafka and Stream Processing at Gojek (Bengaluru) - Saturday, March 23
https://www.meetup.com/Bangalore-Apache-Kafka-Group/events/259119994/

SOUTH AFRICA

Apache Kafka on AWS & Serverless: State of the Union (Johannesburg) - Monday, March 18
https://www.meetup.com/AWS-JOZI/events/259463041/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.