Data Eng Weekly Issue #297

13 January 2019

Several technical posts this week with advice on working with relational databases, Apache Airflow / ETL tools, and Apache Spark structured streaming. There are also posts on securing PII data in your data warehouse, the Kubernetes API, and performance improvements in CockroachDB. Finally, congrats to the data Artistans team on their acquisition and the Apache Airflow team for graduating to a top-level project.

Technical

A good overview of design considerations to enable continuous delivery in a software project built on a relational database. Specifically, they describe an approach to supporting multiple versions of table definitions in your application code to minimize breaking changes.

https://queue.acm.org/detail.cfm?id=3300018

Astronomer writes about how they use Apache Airflow for MetaRouter, their event-routing platform. Among other topics, they discuss their migration from DC/OS to Kubernetes for Airflow, which included a switch to the celery executor.

https://www.astronomer.io/blog/astronomer-on-astronomer-internal-use-case/

An interesting look at the Kubernetes API for building a custom scheduler. While you'll likely not have a reason to implement your own scheduler, it does look like the Kubernetes API and deployment process make this easier than some other distributed systems.

https://banzaicloud.com/blog/k8s-custom-scheduler/

A good introduction to the various stages of ETL, things to consider for each stage (e.g. auditing), types of data cleansing and transformation, common challenges (e.g. performance issues, data format changes), and more.

https://medium.com/hashmapinc/etl-understanding-it-and-effectively-using-it-f827a5b3e54d

An example implementation of role-based access control and database layout in Snowflake to isolate and mask PII data. The solution allows all users to query all the tables, but only some users (roles) to access data without masking.

https://medium.com/hashmapinc/6-steps-to-secure-pii-in-snowflakes-cloud-data-warehouse-f950c35839e3

Confluent has a multi-part tutorial for building an event-driving application on the Confluent Platform. The exercises cover activities like streaming joins, stateful operations, and enrichment with KSQL.

https://www.confluent.io/blog/stream-processing-part-1-tutorial-developing-streaming-applications

CockroachDB writes about transaction pipelining, which is a new feature in version 2.1 of their database. The implementation details are covered in the article (and hard to summarize), but they achieve the impressive feat of turning latency that scales linearly with the number of DML statements to a constant overhead.

https://www.cockroachlabs.com/blog/transaction-pipelining/

An introduction and tutorial to Apache Airflow for data management. Even if you're already familiar with Airflow, you might want to read this one for the long-running airline + weather analogy.

https://medium.com/leboncoin-engineering-blog/data-traffic-control-with-apache-airflow-ab8fd3fc8638

A good list of common rules for building a reliable data platform. The last one is a meta-rule—to use all of your tools (Airflow and Spark among them) to automate as much implementation of the other rules as possible.

https://medium.com/@boazberman/the-big-data-intro-i-wish-ive-got-b435875b4dbe

Apache Spark 2.4.0 includes the ability to select different watermarking strategies for joining streams. This post describes the semantics of the different strategies and has some code examples that demonstrate the difference.

https://www.waitingforcode.com/apache-spark-structured-streaming/apache-spark-2.4.0-features-watermark-configuration/read

An intro into how to efficiently load data using the Java APIs using non-SQL standard LOAD DATA and COPY commands for MySQL and Postgres. For comparison, the author has also written on using JPA and JDBC to quickly insert data.

https://medium.com/@jerolba/persisting-fast-in-database-load-data-and-copy-caf645a62909

News

"Architecting Modern Data Platforms" is a new book from O'Reilly. Covering Hadoop, it discusses tools and offers advice about infrastructure (e.g. compute and networking architecture), platform (e.g. integrating with an identity provider), and operating in the cloud.

https://learning.oreilly.com/library/view/architecting-modern-data/9781491969267/

Apache Airflow was announced as a top-level Apache Software Foundation project.

https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces44

Alibaba has acquired data Artisans, the company that was started by several members of the team behind Apache Flink. Datanami has more details on the history of data Artisans, and what we can expect going forward.

https://www.da-platform.com/blog/data-artisans-alibaba-new-chapter-for-open-source-big-data
https://www.datanami.com/2019/01/08/alibaba-acquires-apache-flink-backer-data-artisans/

A good post on the "Feynman trap" that often occurs when looking for patterns in big data.

https://www.wired.com/story/the-exaggerated-promise-of-data-mining/

The new Cloudera has started talking more about their plans for unifying the Hortonworks Data Platform and CDH distributions. The combined product will be called the Cloudera Data Platform, and existing releases will be supported through January 2022.

https://www.datanami.com/2019/01/10/cloudera-unveils-cdp-talks-up-enterprise-data-cloud/

Releases

LiteCLI is a new CLI for SQLite with auto-complete and other user-friendly features.

https://www.pgcli.com/launching-litecli.html

Version 5.1 of the Databricks Runtime is out with Azure improvements, Databricks Delta improvements, and the ability to install and import Python libraries for particular notebooks.

https://databricks.com/blog/2019/01/08/announcing-databricks-runtime-5-1.html
https://databricks.com/blog/2019/01/08/introducing-databricks-library-utilities-for-notebooks.html

Apache Flume had its first release in over a year. The version 1.9.0 release has a large number of updates and improvements, including support for newer version fo HBase and Kafka.

https://lists.apache.org/thread.html/37e3661aabc759099c600275502c2705f0b34b87c889fb19ca5f2116@%3Cannounce.apache.org%3E

Amazon Web Services has announced Amazon DocumentDB, which is a MongoDB-compatible document database. It has some novel features, including 6x replication across 3 availability zones.

https://aws.amazon.com/blogs/aws/new-amazon-documentdb-with-mongodb-compatibility-fast-scalable-and-highly-available/

Apache HBase 2.1.2, which includes 70 bug fixes and improvements over the 2.1.1 release, was announced.

https://lists.apache.org/thread.html/e843e8e09ef731f6a96c2cfe7c5cbf8f67a75c383084121bcf794cf9@%3Cannounce.apache.org%3E

SQLer is a service for building REST APIs for SQL databases using configuration files containing SQL and JavaScript. SQLer supports validation rules, authorization, and much more.

https://github.com/alash3al/sqler