Data Eng Weekly Issue #298

21 January 2019

It seems like there must have been a lot of new year's resolutions to write blog posts, because this week's issue is a huge one. Lots of variety and coverage of less common topics, like workflow engines, Apache Pig/DataFu, and LSMs. In news and releases, there are conference videos (if you need even more content!) and a couple of new open source projects to check out.

Technical

As this post describes, to prevent tight coupling with your database system, you shouldn't rely on a DB-generated ID.

https://medium.com/ingeniouslysimple/why-did-we-shift-away-from-database-generated-ids-7e0e54a49bb3

Sendbird writes about how they use HAProxy as a router/health checker for talking to replicated Redis.

https://medium.com/@Sendbird/elasticache-for-python-production-payloads-or-how-we-learned-to-stop-worrying-and-love-haproxy-e55f2f7309a0

Digdag https://www.digdag.io/ is an open-source workflow engine from Treasure Data. Culture Amp writes about their choice to use it instead of Apache Airflow.

https://medium.com/@scott_64558/in-praise-of-digag-an-alternative-to-airflow-258a0eef83bc

"Task Failed Successfully" is talk about a new workflow engine that aims to solve the shortcomings (and/or hacky workarounds) of other workflow engines. The system, Prefect, accomplishes this with new primitives to represent complex relationships between tasks, pass data, and react to failure (or non-binary task results)--see the second link for an example. If you like to geek on out workflow systems like I do, then I highly recommend watching the whole talk.

https://www.youtube.com/watch?v=TlawR_gi8-Y
https://medium.com/the-prefect-blog/prefect-runs-on-prefect-3e6df553c3a4

The Banzai Cloud provides a set of open source tools atop of Kubernetes, including a "spotguide" for running Apache Spark. They describe some of its features, including monitoring and log aggregation.

https://banzaicloud.com/blog/spotguides-spark/

Qubole has a tutorial for building an application in their managed Apache Spark service that interacts with Amazon Managed Streaming for Kafka.

https://www.qubole.com/blog/amazon-managed-streaming-for-kafka/

Many distributed systems attempt to be self-healing. In this case, that self-healing causes cascading failures across a number of complex systems: Apache Kafka, Kubernetes, Consul, and Vault. An interesting read of what happened and how they ultimately fixed the problem.

https://medium.com/@daniel.p.woods/on-infrastructure-at-scale-a-cascading-failure-of-distributed-systems-7cff2a3cd2df

This post demonstrates two extensions for the Confluent schema registry—the first makes certain topics read-only, and the second provides a single page webapp to view the contents of the registry.

https://yokota.blog/2019/01/14/fun-with-confluent-schema-registry-extensions/

In the first DEW post on Apache Pig in some time, Paypal writes about several UDFs they've written for Pig and contributed to the Apache DataFu project. These include macros for diffing objects, deduplicating records, and sampling data by a key.

https://medium.com/paypal-engineering/a-guide-to-paypals-contributions-to-apache-datafu-b30cc25e0312

BBM has a blog series on how they've optimized scaling of Apache Zeppelin and Apache Spark by using autoscaling clusters and Spark's dynamic allocator.

https://medium.com/@F4Qtech/scaling-zeppelin-for-enterprise-part-3-ab4daae6f1d4

The Confluent blog has a tour of various types of testing, along with available tools, for Kafka streaming applications. The post covers unit testing, integration testing, avro+schema registry testing, and multi-datacenter simulation testing (based on docker compose).

https://www.confluent.io/blog/stream-processing-part-2-testing-your-streaming-application

A transcript, slides, and video of a talk on how many high-performance data systems (e.g. Apache Cassandra or those using RocksDB) store data on disk using Log-Structured Merge-Trees (LSM). A key feature of LSMs is the ability to write data quickly by avoiding random disk io.

https://www.infoq.com/presentations/storage-algorithms

By default, Apache Hive and Apache Impala might not be doing what you'd expect when reading columns out of an Apache Parquet file. This post walks through some examples and summarizes how to fix if you stumble across the same issue.

https://medium.com/@kartik.gupta_56068/hive-vs-impala-schema-loading-case-reading-parquet-files-acd0280c2cb3

This post has an introduction to higher order functions in Apache Spark 2.4.0, which enables a whole new slew of functions that take a lambda function as a parameter. There are some examples for nested arrays of data (e.g. filter, exists, aggregate, and aggregate).

https://www.waitingforcode.com/apache-spark-sql/apache-spark-2.4.0-features-array-higher-order-functions/read

A deep dive into the data replication architecture of Elasticsearch, including discussion of how the cluster responds to various failure scenarios and the PacificA distributed consistency protocol it uses.

https://medium.com/@Alibaba_Cloud/elasticsearch-distributed-consistency-principles-analysis-3-data-a98cc436bc6b

This post has a collection of eight tips for optimizing performance of Amazon Redshift.

https://medium.com/@eightfold/building-a-scalable-interactive-analytics-backend-aebeb79ee0c8

CockroachDB uses RocksDB as its underlying storage engine. This post describes why they chose RocksDB and how they've implemented core functionality. There's a good discussion of why most of their read access to RocksDB is via scans, some considerations with data replication, and how they batch operations to avoid the performance overhead of calling C from Golang.

https://www.cockroachlabs.com/blog/cockroachdb-on-rocksd/

Homeaway shares several hard-learned anti-patterns for building a streaming system. Their example shortcuts (e.g. hardcoding a partition key and making their header optional) were taken to increase adoption through ease of use, but caused problems down the line.

https://medium.com/homeaway-tech-blog/the-need-for-a-stream-registry-intro-c33881bc113

The author of this post describes three properties of a robust data pipeline—only processing required data, checkpointing, and documentation. It includes motivating examples for each.

https://medium.com/@gohitvaranasi/how-to-build-robust-data-pipelines-in-the-big-data-ecosystem-806f84d9009f

TLA+ is a tool for evaluating correctness of distributed/concurrent systems. This post provides a preview introduction and shows how to invoke it via a CLI utility rather than using the clunky IDE.

https://medium.com/@bellmar/introduction-to-tla-model-checking-in-the-command-line-c6871700a6a2

Newsletter Rec

For staying up to date on the latest security news, tools, and best practices, I strongly endorse Security Newsletter. Its weekly emails are much like this newsletter but focused on security. Checkout the archives and subscribe!

https://securitynewsletter.co/

News

Stratechery has a great analysis of the economics of open source software, using MongoDB/DocumentDB as a case study.

https://stratechery.com/2019/aws-mongodb-and-the-economic-realities-of-open-source/

It turns out that the MySQL protocol allows the server to request arbitrary files from the client, a fact which was recently exploited.

https://gwillem.gitlab.io/2019/01/20/sites-hacked-via-mysql-protocal-flaw/

"The Internals of PostgreSQL" is an online book for database admins and system developers. There are 11 chapters.

http://www.interdb.jp/pg/index.html

The videos from PyData DC 2018, which took place last fall, have been published. Lots of content on machine learning, distributed computation, data cleaning, and more.

https://www.youtube.com/playlist?list=PLGVZCDnMOq0p9pa2s8WXdjk7nU8iei9ay

Releases

Mockedstreams is a scala library for testing Apache Kafka and Kafka Streams applications. Version 3.1 was just released with support for the Scala DSL for topologies.

https://github.com/jpzk/mockedstreams/releases/tag/v3.1

Apparate is a new open source tool from ShopRunner for bundling libraries via CI/CD for Databricks.

https://databricks.com/blog/2019/01/15/apparate-managing-libraries-in-databricks-with-ci-cd.html
https://github.com/ShopRunner/apparate

Version 2.6.0 of Apache Kylin, the OLAP service Hadoop, has been released.

https://lists.apache.org/thread.html/43e640f5cdc86edff1ba3632fb27d4bfe0f370a43b5731ae44e999bf@%3Cannounce.apache.org%3E

The FoundationDB team has released a new Record Layer, which provides relational db semantics atop of RecordDB. As the blog post mentions, this system powers the iCloud CloudKit. It has a query API and query planner, although there doesn't yet appear to be a SQL interface.

https://www.foundationdb.org/blog/announcing-record-layer/

Scylla, the open source database with compatibility with Apache Cassandra, has released version 3.0. The new version includes materialized views, secondary indexes, hinted hand-offs, performance improvements, and more.

https://www.scylladb.com/2019/01/17/scylla-open-source-3-0-overview/

Events

Curated by Datadog ( http://www.datadog.com )

California

CDC and Building a Streaming Analytics Stack with Kafka and Druid (Menlo Park) - Tuesday, January 22
https://www.meetup.com/KafkaBayArea/events/257342229/

Colorado

Introducing the Data River: Apache Druid Is the Next Analytics Platform (Broomfield) - Thursday, January 24
https://www.meetup.com/Boulder-Denver-Big-Data/events/256868245/

Florida

Big Data Processing Engine: Which One Do I Use? (Weston) - Thursday, January 24
https://www.meetup.com/futureofdata-miami/events/257158328/

New York

Unifying Messaging, Queuing, Streaming & Lightweight Compute with Apache Pulsar (New York) - Thursday, January 24
https://www.meetup.com/Reactive-New-York/events/257821056/

COLOMBIA

Machine Learning, API, Flask & Python with Kafka (Bogota) - Thursday, January 24
https://www.meetup.com/Pyladies-Co-Bogota/events/258000017/

UNITED KINGDOM

Airflow Meetup @ Google (London) - Wednesday, January 23
https://www.meetup.com/London-Apache-Airflow-Meetup/events/257090138/

SWEDEN

IoT Sthlm #30: Scalability & Data (Stockholm) - Tuesday, January 22
https://www.meetup.com/IoTStockholm/events/257893870/

GERMANY

Messaging and Connecting Services + HTTP Headers: A Research Expedition (Dusseldorf) - Thursday, January 24
https://www.meetup.com/Web-Engineering-Duesseldorf/events/255993083/

Hadoop: Taming the Elephant with a Whale (Hamburg) - Thursday, January 24
https://www.meetup.com/jug-hamburg/events/257664718/

ISRAEL

Salting Your Spark to Scale (Herzliya) - Wednesday, January 23
https://www.meetup.com/AppsFlyer/events/258151406/