Data Eng Weekly

Data Eng Weekly Issue #317

14 July 2019

Lots of new tools to check out this week—Dagster, Dataform, Beast (a Kafka-to-BigQuery service), coverage of OpenTSDB and Graphite (yes they're still being used and getting new tools!), and two great technical deep dives on change data capture for Apache Cassandra and efficiently loading data to Postgresql using Python.


Google Cloud writes about running a database on Kubernetes—they describe some Kubernetes-specific items to think about and reference a couple of Kubernetes Operator implementations that address the major pieces of running a database on Kubernetes.

Dagster is a new open source workflow engine tool from a team that includes one of the co-creators of GraphQL. It includes a nice UI for inspecting your pipeline and a number of integrations, including with AWS, GCP, Pagerduty, Snowflake, Datadog, Spark, and Airflow. The tool is meant to provide functionality to bring together data scientists, data engineers, and others to collaborate on data pipelines.

A look at OpenTSDB's row key format, how that format distributes metrics across region servers, and a new tool from the Salesforce team to lookup the region servers for a particular key.

Another look at time series databases—Teads writes about how they scaled graphite from 50k metrics per second to 400k per second by switching to the Go Graphite stack. That stack includes a relay server that implements consistent hashing, and the Teads folks added deterministic replication. Their post includes an architectural overview and a discussion of how they add and replace nodes (which requires some additional scripting) to maintain data availability.

Dataform is a new tool for building data workflows using SQL in a data warehouse. It lets you schedule creation of new tables based on a SQL query, store these transformations in version control, and manage checks/alerts on those tables.

The Sematext blog looks at a number of open source tools for log analysis and monitoring. The break out the components of a log pipeline into log shippers/parsers, storage, search, visualization, and alerts. For each component, they look at some popular tools (like rsyslog as a log shipper) and describe some common combinations.

This post looks at a half dozen strategies for loading data into Postgresql using Python and psycopg. The author looks at both timing and memory usage—the performance varies cross four orders of magnitude for both. The fastest is a COPY-based approach using a string iterator. The author notes, though, that the best approach for your use case might be data size and structure-dependent.

Beast is an open source project from Gojek for transferring protobuf datasets from Apache Kafka to BigQuery. The Gojek blog writes about the architecture (how it handles commit offsets and scales). They also have a helm chart if you want to give it a try on Kubernetes.

This post describes the architecture of a new open-source change data capture tool for Apache Cassandra from the folks at Wepay. The author describes several potential designs and why they landed on the architecture they did. Lots of interesting details in how they ship replicated data from an eventually consistent distributed database to BigQuery for analysis.

The New Stack looks at two presentations from last month's QCon New York. PayPal spoke about their Hera service (which is open source) for multiplexing connections to databases and horizontal scaling. LinkedIn spoke about Brooklin, their tool for moving data between distributed data stores.


Curated by Datadog ( )


Hosted at Confluent HQ: Exactly Once + Kubernetes + CDC with Pinterest (Palo Alto) - Tuesday, July 16


.NET for Apache Spark (Bellevue) - Wednesday, July 17


Data Movement & Transformation Between Heterogeneous SQL & NoSQL Datastores (Austin) - Wednesday, July 17


Cleveland Big Data Meetup (Cleveland) - Monday, July 15


Azure Data Factory v2 Data Flows (Tampa) - Tuesday, July 16

New York

Kafka on Kubernetes: Just Because You Can, Doesn't Mean You Should! (New York) - Thursday, July 18


Apache Spark and Spark Structured Streaming (Bogota) - Thursday, July 18


6th Data Engineering Meetup (Belo Horizonte) - Thursday, July 18


Proof of Kafka (Milano) - Wednesday, July 17


Data Tech Talks: Open Source & Public Cloud (Warszawa) - Wednesday, July 17


Building Scalable Data Pipelines with Kafka and Apache Spark (Nairobi) - Saturday, July 20


Amazon Managed Blockchain & Data Engineering (Pune) - Saturday, July 20


Sydney Data Engineering Meetup (Haymarket) - Thursday, July 18

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.