Data Eng Weekly


Data Eng Weekly Issue #303

24 February 2019

Several posts this week on RDBMS performance analysis & optimization, change data capture, and systems built with Kafka. There's also a great post about implementing unified event logging and a new paper on distributed consensus. In releases, Flink, Kafka, Pulsar, and NiFi all announced new versions this week.

Technical

Deliveroo writes about how they've designed a Kafka data platform using Protobuf messages. The post describes how they strictly enforce schemas using tests written with the Protobuf Java API and at runtime with a clever configuration of producer authentication.

https://deliveroo.engineering/2019/02/05/improving-stream-data-quality-with-protobuf-schema-validation.html

Citus Data has a post on the pg_stat_statement extension for PostgreSQL, which records metrics about queries and exposes those for monitoring and analysis.

https://www.citusdata.com/blog/2019/02/08/the-most-useful-postgres-extension-pg-stat-statements/

If you've used a relational database long enough, you've probably been there—a query that used to be fast suddenly slows down. After identifying the root cause (the optimizer was choosing the wrong index), the post describes how to perform table optimization and analysis in MySQL without the normal downtime for those operations.

https://medium.com/cleartax-engineering/running-mysql-optimize-without-downtime-cb29b35c0086

An introduction to and tutorial for running FoundationDB on AWS. There's also some example python code for loading and querying the database.

https://tech.marksblogg.com/minimalist-guide-tutorial-foundationdb.html

Another database performance tuning post describes how to force MySQL to use an index at query time. The article has a good explanation of how the database optimizer decides whether or not to use an index... and why it makes the wrong decision in this example.

https://www.vividcortex.com/blog/using-force-index

A collection of notebooks documenting best practices for working with PySpark.

https://github.com/ericxiao251/spark-syntax

At first blush, SQL might seem powerful for simple queries but limited in more complex computations. This post describes a couple of situations where SQL can outperform a python script by 1) using window functions and 2) parsing and manipulating JSON data (both operations use some advanced SQL features).

https://towardsdatascience.com/python-vs-sql-comparison-for-data-pipelines-8ca727b34032

A look at how Intercom uses DynamoDB Streams (a change data capture feature of Amazon DynamoDB) with a Lambda function to compute and store JSON PATCH data about historical changes to their data sets.

https://www.intercom.com/blog/how-we-used-dynamodb-streams/

This relatively easy to read paper on distributed consensus describes a new, generalized solution to distributed consensus using write-once registers. If you've enjoyed the Paxos and related papers, then this one is a must read.

https://arxiv.org/abs/1902.06776

The Debezium blog has a post that discusses the tradeoffs of event log architectures (write to Kafka first vs. change data capture) and proposes the outbox pattern, which softens some of the tradeoffs. Briefly, the idea is to have an outbox table with a generic field in which you can write JSON or other event data as part of the same transaction as your DB operations. This table is mirrored to Kafka using a change data capture tool like debezium.

https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/

The Last Pickle has some advice for distributing your keyspace when building a new Apache Cassandra cluster by specifying initial tokens for seed nodes. The post goes into detail on how to calculate optimal configuration and bring up a new cluster with the right settings.

http://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html

Good description of the components of a unified event logging framework and some practical advice for standing up the main components. The post covers how one company does event validation against a schema, implements schema composition for Avro, and uses heartbeat events to detect data loss.

https://medium.com/rmtl-test-blog/unified-event-logging-3a4cc64fa1c6

You may have heard about the recent PostgreSQL fix to their fsync (data sync to disk) implementation that impacts several versions of the database. This post from Percona describes the problem, the fixes in recent updates, and several other details.

https://www.percona.com/blog/2019/02/22/postgresql-fsync-failure-fixed-minor-versions-released-feb-14-2019/

This three part post describes an analytics project that consumes data from the Cosmos DB change feed feature using Azure Databricks. Lots of pieces and good details on how to implement something with these tools.

https://medium.com/microsoftazure/an-introduction-to-streaming-etl-on-azure-databricks-using-structured-streaming-databricks-16b369d77e34

A good introduction to testing Apache Airflow tasks and DAGs using pytest. The post covers things like mocking Airflow, mocking external dependencies (e.g. Postgres), and fixtures for Airflow DAGs. It also covers debugging tests using the python debugger, pdb.

https://blog.godatadriven.com/testing-and-debugging-apache-airflow

This post walks through building a Kafka Connect source connector for an HTTP API. It is written in Scala, the tradeoffs of which are covered in the post.

https://www.rittmanmead.com/blog/2019/02/kafka/

AWS has a post on the steps that EMR takes to minimize the impact of lost nodes during a Spark job, which can cause major delays in cases where intermediate data needs to be recalculated. Some of the actions, including using YARN decommissioning & enabling several Spark options, can also be used in other cases where you might need to decommission nodes in a Spark cluster with minimal impact.

https://aws.amazon.com/blogs/big-data/spark-enhancements-for-elasticity-and-resiliency-on-amazon-emr/

This tutorial covers feeding NGINX logs into Apache Kafka (using either kafkacat or the python producer), and using Apache Flume to write data from Kafka to HDFS.

https://tech.marksblogg.com/minimalist-guide-tutorial-flume.html

Apache NiFi can infer the schema of data it processes and convert records to Apache Avro. This walkthrough covers the two ways it does that—either at ingestion or while deserializing data.

https://medium.com/@abdelkrim.hadjidj/democratizing-nifi-record-processors-with-automatic-schemas-inference-4f2b2794c427

Jobs

Data Engineer - Python, Wooga, Berlin https://jobs.dataengweekly.com/jobs/63fbb5ea-1c49-463f-bda7-598a56a13831

Software Engineer, Value Platform, Nuna, Inc., San Francisco https://jobs.dataengweekly.com/jobs/4a88b4e0-3457-48da-977e-afa368cdd4f1

Data Engineer, Starship Technologies, Tallinn, Estonia https://jobs.dataengweekly.com/jobs/7c9152cb-d0b5-496d-b304-3252e0c01c3f

News

The call for papers for Kafka Summit San Francisco is now open through April 15th. The conference takes place September 30th and Oct 1st. Meanwhile, the agenda for Kafka Summit London, which takes place in May, has been announced.

https://www.confluent.io/blog/kafka-summit-2019-3-big-things

An overview of common responsibilities that define data engineering and data scientist roles. The post also talks about tools and provides an example of the distinction between the two roles.

https://towardsdatascience.com/what-is-the-difference-between-a-data-engineer-and-a-data-scientist-a25a10b91d66

The Alibaba blog has an article describing the history of the Flink API, Flink Checkpointing & Recovery, and the Flink Runtime from version 1.0.0 through the 1.6.0 release. Lots of interesting historical context that show the evolution of the Flink project over the last few years.

https://medium.com/@alitech_2017/a-brief-history-of-flink-tracing-the-big-data-engines-open-source-development-87464fd19e0f

Releases

Apache Flink 1.7.2 was released with over 40 fixes and improvements.

https://flink.apache.org/news/2019/02/15/release-1.7.2.html

Apache Kafka 2.1.1 is out with over 40 bug fixes and improvements.

https://lists.apache.org/thread.html/b407eb9958bb4336d725475b6d46553fb174b0bde0339adfe84efefd@%3Cannounce.apache.org%3E

Databricks Delta has new features for converting tables to Parquet, time travel, and for merging data. There are more details in the announcing blog post.

https://databricks.com/blog/2019/02/19/new-databricks-delta-features-simplify-data-pipelines.html

The 2.3.0 release of Apache Pulsar is out. It has a number of new features like zstd compression, token baed authentication, and optimizations for the Presto connector. There are also enhancements to Pulsar IO (several new connectors) and Pulsar Functions (including a Kubernetes runtime).

http://pulsar.apache.org/release-notes/#230-mdash-2019-02-20-a-id-230-a

Apache NiFi 1.9.0 was released with improved integration with Apache Kudu, Apache Impala, Google Big Query, and Amazon Web Services. There are over 100 improvements in the release.

https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.9.0

Marqeta has open sourced Quantum, an engine for time series aggregation that uses Redis Cache for storing materialized aggregate data. Aggregations are defined using a YAML file, and it has a simple DSL called Q for querying data.

https://medium.marqeta.com/taming-time-series-data-438b45c13e2e

Events

Curated by Datadog ( http://www.datadog.com )

California

Data Updates in HDFS-like Stores Using Apache Hive, Hudi, or Iceberg (San Jose) - Tuesday, February 26
https://www.meetup.com/Adobe-Experience-Platform-Meetup/events/258746871/

Illinois

Kafka's Core Concepts by David Greenfield (Chicago) - Wednesday, February 27
https://www.meetup.com/IllinoisJUGChicago/events/258598799/

North Carolina

High-Performance Stream-Oriented Processing Systems for IoT (Charlotte) - Tuesday, February 26
https://www.meetup.com/Enterprise-Developers-Guild/events/259063571/

CANADA

Data Processing Platforms (Toronto) - Tuesday, February 26
https://www.meetup.com/DataDenToronto/events/256195255/

An Introduction to Streaming Data and Stream Processing with Apache Kafka (Toronto) - Wednesday, February 27
https://www.meetup.com/Toronto-Kafka/events/258869662/

BRAZIL

First Data Engineering Meetup (Belo Horizonte) - Tuesday, February 26
https://www.meetup.com/engenharia-de-dados/events/258685226/

UNITED KINGDOM

Making Stream Processing Scale (London) - Tuesday, February 26
https://www.meetup.com/London-In-Memory-Computing-Meetup/events/258800682/

FRANCE

Data Engineers Meetup @ Datadog (Paris) - Tuesday, February 26
https://www.meetup.com/Paris-Data-Engineers-Meetup/events/258465471/

NETHERLANDS

Real-Time Fast Data in a Microservice Architecture (Dedemsvaart) - Thursday, February 28
https://www.meetup.com/VechtdalDev/events/258929544/

GERMANY

First Apache Kafka Meetup Stuttgart 2019 (Leinfelden-Echterdingen) - Wednesday, February 27
https://www.meetup.com/Stuttgart-Apache-Kafka-Meetup/events/257403038/

Kafka Meetup (Munich) - Thursday, February 28
https://www.meetup.com/Apache-Kafka-Germany-Munich/events/258966190/

SWITZERLAND

Meetup on Kafka & Strimzi (Bern) - Thursday, February 28
https://www.meetup.com/Messaging-Streaming-Switzerland/events/256745372/

POLAND

Streaming Topic Model with Apache Flink + Using IoT Analytics to Save the Planet (Krakow) - Friday, March 1
https://www.meetup.com/datakrk/events/259014222/

DataMass Meetup #2: Kafka for Beginners (Gdansk) - Friday, March 1
https://www.meetup.com/DataMass/events/257812702/

ISRAEL

Apache Kafka in Production (Tel Aviv-Yafo) - Wednesday, February 27
https://www.meetup.com/ApacheKafkaTLV/events/258568278/

SOUTH AFRICA

Apache Kafka and KSQL (Sandton) - Wednesday, February 27
https://www.meetup.com/Johannesburg-Kafka-Meetup/events/258870174/

AUSTRALIA

Inaugural Melbourne Data Engineering Meetup (Melbourne) - Wednesday, February 27
https://www.meetup.com/Melbourne-Data-Engineering-Meetup/events/257979787/