Data Eng Weekly Issue #309

28 April 2019

Another issue covering two weeks worth of content (I'll return to the weekly issues again eventually!), and there are great articles on change data capture from Uber, powerful features of Postgres, and auto-scaling Apache Kafka Streams. There are two newly open-source projects from Databricks and Google which are worth checking out, too.

Technical

The Uber Engineering blog has a look at their change data capture system, called DBEvents, which ships lots of data from MySQL, Cassandra, etc to HDFS and Kafka. Their post goes into the details of how they bootstrap new data sources, the tools they've built for freshness, quality, and efficiency, and specifics on how they've standardized changelog events across systems. Several of the tools that that mentioned, like StorageTapper for processing the MySQL binlog and Marmary for data ingestion, are open-source projects from Uber.

https://eng.uber.com/dbevents-ingestion-framework/

This article has a good look at the trade-offs and considerations in designing a distributed database. It describes YugaByteDB's approach to those key decisions, like how to implement distributed transactions and considerations for clocks, as they implemented a system that is based on the Google Spanner architecture.

https://blog.yugabyte.com/6-technical-challenges-developing-a-distributed-sql-database/

An Apache Kafka Streams application can dynamically scale its number of workers, taking advantage of the Kafka client's builtin consumer rebalancing. This post looks at how you can link this with Kubernetes' custom metrics + autoscaling support (although the solution works just as well for non-Kubernetes deployments) to scale up and down based on consumer lag. There's a tutorial for implementing this strategy using the Prometheus jmx-exporter and Google Stackdriver.

https://medium.com/xebia-france/kafka-streams-a-road-to-autoscaling-via-kubernetes-417f2597439

If you had asked me whether SQL is Turing complete, I’m not sure how I’d answer. It turns out it is, and you can do some interesting things once you have that knowledge—like generating fractals. This post describes a bunch of interesting SQL features to try out, like Postgres' generate_series and recursive common table expressions (which can be used for iterative algorithms).

https://malisper.me/generating-fractals-with-postgres-escape-time-fractals/

Databricks has open sourced their Delta platform for Apache Spark that adds ACID semantics, data versioning, schema enforcement and evolution, and more. If you want to quickly try it out, the Delta 0.1.0 release provides drop-in replacement for saving a Spark dataframe to Delta rather than as Apache Parquet.

https://databricks.com/blog/2019/04/24/open-sourcing-delta-lake.html
https://delta.io/

Google has open sourced ZetaSQL, which is a SQL frontend (parser, analyzer, and more). It is built to work with various backends, and it comes with with lots of table stakes features like approximate algorithms (HLL and friends) and JSON support. It also has some interesting features like native support for constructing protobufs and writing UDFs in JavaScript. As you might expect, GRPC is used to communicate with the server. Given how popular Apache Calcite has been in the JVM ecosystem, it'll be interesting to see if ZetaSQL takes off elsewhere.

https://github.com/google/zetasql

Really handy list of less-known but highly useful Postgres features, like pub/sub notifications, table inheritance, and various types of indexes. Even if you're using Postgres on a daily basis, chances are there's something new to try out.

https://pgdash.io/blog/postgres-features.html

HomeAway describes how they ship PMML model definitions to KSQL (via a user-defined function) to implement real-time model evaluation. If you're not familiar with PMML, it's a cross-language standard for defining model parameters. There's a link to the PMML file for the accompanying demo (the code for the the end-to-end-demo is on github) if you want to see what a file looks like.

https://medium.com/homeaway-tech-blog/ml-in-ksql-a44d455b0687

Who doesn't love a simple solution to distributed systems problem? This post describes how instagram solved a thundering herd problem occurring during cold start of a cache-heavy service. Briefly, they had the clever idea to cache Promises rather than the actual query responses. If you're not familiar with thundering herds, the post serves as a good introduction.

https://instagram-engineering.com/thundering-herds-promises-82191c8af57d

Events

Curated by Datadog ( http://www.datadog.com )

California

Apache Kafka Controller with Jun Rao + Kafka at Uber (Palo Alto) - Monday, April 29
https://www.meetup.com/KafkaBayArea/events/260657485/

Analyzing Streaming Data + Columnar Formatted Data for Analytics (Menlo Park) - Tuesday, April 30
https://www.meetup.com/Bay-Area-In-Memory-Computing/events/260653275/

Microservices to Analyze Event Streams + The Art and Science of Capacity Planning (Foster City) - Tuesday, April 30
https://www.meetup.com/KafkaBayArea/events/260545395/

The Latest in Apache Hive, Spark, Druid, and Impala (Palo Alto) - Wednesday, May 1
https://www.meetup.com/futureofdata-siliconvalley/events/260713401/

Missouri

Data Engineering and the Data Science Lifecycle (Saint Louis) - Wednesday, May 1
https://www.meetup.com/St-Louis-Big-Data-IDEA/events/257350394/

Illinois

CEO/Founder of Qubole Discusses Big Data in the Cloud: From Facebook to IoT (Chicago) - Wednesday, May 1
https://www.meetup.com/Chicago-Analytics-and-Data-Science-in-the-Cloud/events/259959476/

Wisconsin

CEO/Founder of Qubole Discusses Big Data in the Cloud: From Facebook to IoT (Madison) - Tuesday, April 30
https://www.meetup.com/BigDataMadison/events/258887308/

Georgia

Big Data Processing Engines: Which One Do I Use? (Atlanta) - Tuesday, April 30
https://www.meetup.com/BigDataATL/events/260059921/

New York

Introduction to Data Processing with Apache Spark (Rochester) - Thursday, May 2
https://www.meetup.com/RocDev/events/259972122/

UNITED KINGDOM

Apache Beam Meetup: Beam at Lyft + Datalake Using Beam + Schemas (London) - Wednesday, May 1
https://www.meetup.com/London-Apache-Beam-Meetup/events/260526077/

GERMANY

Confluent and Visual Meta Kafka Talk (Berlin) - Monday, April 29
https://www.meetup.com/Berlin-Apache-Kafka-Meetup-by-Confluent/events/260244223/

Kafka, Microservices, and More from Confluent and Camunda (Munich) - Monday, April 29
https://www.meetup.com/Apache-Kafka-Germany-Munich/events/260243521/

SWITZERLAND

Our First Apache Kafka Meetup in Bern (Bern) - Thursday, May 2
https://www.meetup.com/Bern-Kafka/events/260282429/

ISRAEL

Druid, Data Pipelines, and Engineering (Tel Aviv-Yafo) - Tuesday, April 30
https://www.meetup.com/Druid-Israel/events/260506060/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.