Data Eng Weekly Issue #295

30 December 2018

Short issue this week given the holidays in the US. Several posts about Apache Airflow, and also great posts on TLA+, circuit breakers, and Uber's M3. And if you have some extra time off for reading, there's a link to the papers from the upcoming Conference on Innovative Data Systems Research, which has a lot of interesting content.

Technical

Great introduction to the circuit breaker pattern for building resilient systems. Discusses some of the implementation details like whether to scope a circuit breaker to each host or per service or something else.

https://engineering.grab.com/designing-resilient-systems-part-1

This presentation has an overview of the new features in Apache Flink 1.7, including support for evolving the schema of records in state store, joins on temporal tables, pattern analysis via the MATCH_RECOGNIZE SQL statement, and an exactly-once S3 StreamingFileSink. There's also some details of what's coming in Flink 1.8.

https://www.slideshare.net/tillrohrmann/apache-flink-17-and-beyond

A good look at the architecture of a real-world Amazon EMR and Spark-based data analytics platform. The post covers the overall architecture and provides example Spark code (along with some tips).

https://medium.com/@tomas.duhourq/building-scalable-analytics-with-aws-part-i-6de6a90e3513

The HiveWarehouseConnector (HWC) for Spark provides both a read API for executing queries via Hive and retrieving results as DataFrames as well as for writing out data to Hive from DataFrames and via Structured Streaming. HWC is bundled with HDP, and the code is open source on Github.

https://hortonworks.com/blog/hive-warehouse-connector-use-cases/

The HELK stack extends the standard ELK stack with tools for analyzing security event log data. This post looks at using KSQL to enrich the data already in Kafka with joins for additional analysis.

https://posts.specterops.io/real-time-sysmon-processing-via-ksql-and-helk-part-1-initial-integration-88c2b6eac839

Great collection of tips for working with Apache Airflow, like how to efficiently define DAGs and storing passwords/secrets for db credentials.

https://medium.com/datareply/airflow-lesser-known-tips-tricks-and-best-practises-cf4d4a90f8f

TLA+ is a formal specification used to verify distributed systems. This tutorial shows how to use TLA+ to analyze two-phase commit.

https://muratbuffalo.blogspot.com/2018/12/2-phase-commit-and-beyond.html

This post looks at the architecture of Uber's M3 open source distributed time series database. It describes the main components, compatibility with prometheus and statsd APIs, the query language, and more.

https://towardsdatascience.com/introducing-m3-8790c503ce24

Often with batch jobs, it's important to detect delayed or slow jobs in addition to failed ones. This article describes how Walmart Labs analyzes historic runs to predict when a job isn't meeting its SLA.

https://medium.com/walmartlabs/auditing-airflow-batch-jobs-73b45100045

Geoblink writes about their usage of Apache Airflow. They particularly like the easy extensibility of Airflow, which they demonstrate by showing how to implement a PostgresHook for loading Postgis data.

https://medium.com/geoblinktech/bring-sanity-to-your-data-pipelines-with-apache-airflow-3c9906aac77c

Jobs

Lead Distributed Data Engineer, Nomics, Remote https://jobs.dataengweekly.com/jobs/ffddf793-9772-4ff2-9c1a-aca3f02064c5

The new year finds lots of folks looking for a change! Reach Data Eng Weekly candidates by posting a job to our job board for $99. https://jobs.dataengweekly.com/

News

Datanami has coverage of the companies, like Redis, MongoDB, and Confluent, that licensing their software with restrictions for SaaS usage.

https://www.datanami.com/2018/12/24/cloud-backlash-grows-as-open-source-gets-less-open/

The Conference on Innovative Data Systems Research is this January, but the papers for the conference have already been posted online. There are several interesting papers looking at topics like Deep Learning for query optimization and using specialized hardware like GPUs for database applications.

http://cidrdb.org/cidr2019/program.html

The EU is funding a bug bounty for open source software, including Apache Kafka and WSO2.

https://www.zdnet.com/article/eu-to-fund-bug-bounty-programs-for-14-open-source-projects-starting-january-2019/

Releases

Apache Flink had a string of dot releases for the 1.5, 1.6, and 1.7 branches. There are lots of bug fixes and improvements in each release.

https://flink.apache.org/news/2018/12/21/release-1.7.1.html
https://flink.apache.org/news/2018/12/22/release-1.6.3.html
https://flink.apache.org/news/2018/12/26/release-1.5.6.html