09 September 2018
After a week off (hopefully you noticed!), there's a lot of great content and news/releases to catch up on. Technical posts cover workflow engines, distributed systems (consistent hashing and distributed hash joins), Flink at New Relic, Kafka at Stitch Fix, and much more. In news and releases, Amazon announced that they're bringing their Relational Database Service to VMWare, Spotify has open sourced their cstar
Cassandra maintenance tool, Cloudera announced several new releases, and data Artisans have a new enterprise release and a new product for ACID transactions on streams.
For data engineers who are frustrated with running custom monitoring scripts and reactive troubleshooting - intermix.io is a single dashboard that helps you instantly understand Amazon Redshift performance, dependencies, and bottlenecks. Fix slow dashboards and run faster queries. See across your cluster, data apps and users. Plan ahead and grow with your data.
We are giving every DataEngWeekly reader an extended free trial. Visit https://nter.mx/dataengweeekly to start.
A look at the way Luigi schedules and executes tasks in a workflow. If you're new to Luigi, it's likely pretty useful information.
http://yaaics.blogspot.com/2018/08/code-execution-order-in-luigi-pipelines.html
This presentation describes a few of the ways that the AirBnB data engineering team is building tools atop of Apache Airflow for A/B testing, common types of queries, engagement metrics, capacity planning, and more.
https://prezi.com/p/adxlaplcwzho/advanced-data-engineering-patterns-with-apache-airflow/
This article provides a great overview of consistent hashing algorithms. It describes the main algorithm that's used by Riak, what some of the tradeoffs are in this solution, and describes plans for a new implementation in the Wallaroo stream system that avoids the limitations from Riak's (and DynamoDB's) approach.
https://www.infoq.com/articles/dynamo-riak-random-slicing
SocialCops writes about their approach to generalizing and automating the creation of Airflow DAGs through building job template runners with templates defined as YAML.
https://blog.socialcops.com/technology/engineering/airflow-meta-data-engineering-disha/
This post shows how generate a JSON changelog of a sqlite table using triggers and builtin JSON functions. Given how ubiquitous JSON functions have become in RDBMSes, you can likely use a variation of this technique other places too.
https://blog.budgetwithbuckets.com/2018/08/27/sqlite-changelog.html
An overview of how to organize data in Google BigQuery to keep costs and execution times low, using date partitioning and clustering. It also describes some fo the limitations and tradeoffs in these solutions.
If you're curious about the details of implementing a join in a distributed data system, the team behind CrateDB describes their approach to implementing hash joins in this three-part series.
https://crate.io/a/lab-notes-how-we-made-joins-23-thousand-times-faster-part-three/
Postgres 11 is getting the ability to efficiently add a column with a default value. Previously, this required an exclusive lock during a full table rewrite, so this is a pretty big improvement for operators.
https://brandur.org/postgres-default
The New Relic team writes about their experiences building an application on Apache Flink. They share five lessons learned, from understanding Flink's lazy evaluation to minimizing state.
https://blog.newrelic.com/engineering/what-is-apache-flink/
Both Apache Kafka and Apache Pulsar can be used as a work queue—a list of tasks to be performed asynchronously in a distributed fashion. This post looks at the Pulsar primitives, which are missing in Kafka, that make it easier to implement a work queue.
https://streaml.io/blog/creating-work-queues-with-apache-pulsar
Debezium, which is a program for doing change data capture of relational databases to Kafka, has an embedded mode to run in any JVM application. This tutorial shows how to use embedded mode with some simple code to send the change data to Amazon Kineses rather than Kafka.
https://debezium.io/blog/2018/08/30/streaming-mysql-data-changes-into-kinesis/
The LINE engineering blog writes about their experiences building a multi-tenant Hadoop query platform. They describe the notebook and scheduling service they built (Apache Zeppelin wasn't quite right due to some limitations).
https://engineering.linecorp.com/en/blog/detail/333
Intermix.io (disclosure: they are sponsoring this issue) have compiled a list of fourteen ways to get the most performance out of Amazon Redshift. It covers things like defining the work queues, when to perform operations like recomputing statistics, and avoiding row skew.
https://www.intermix.io/blog/top-14-performance-tuning-techniques-for-amazon-redshift/
Stitch Fix writes about their "Data Highway" built on Apache Kafka and Kafka Connect. There's a good description of their design process and some of the trade-offs they considered as they implemented Kafka.
https://multithreaded.stitchfix.com/blog/2018/09/05/datahighway/
This article provides a good introduction to the architecture of Apache Druid (incubating), which is an analytics database and query system.
https://towardsdatascience.com/introduction-to-druid-4bf285b92b5a
Nested objects, lazy values, and several other examples that demonstrate the quirks of Scala and Spark serialization of external values.
https://medium.com/onzo-tech/serialization-challenges-with-spark-and-scala-a2287cd51c54
The Confluent blog has a new walkthrough in which they use KSQL to integrate external data from a REST API. It covers a number of KSQL features, like data masking, changing serialization format, and applying a schema to data.
https://www.confluent.io/blog/data-wrangling-apache-kafka-ksql
Amazon and VMware announced that the Amazon Relational Database Service will be ported to run on VMWare. A preview is available now.
An argument for using the standard and ubiquitous SQL rather than home-grown query languages and layers of abstraction (like database object relational mappings) atop of SQL.
https://erikbern.com/2018/08/30/i-dont-want-to-learn-your-garbage-query-language.html
This post summarizes a discussion about the current and future state of Apache Impala. Takeaways include the growing movement to separate compute from storage and how competition between engines (Presto is mentioned a couple of times) is helping drive innovation. There are a number of links to external documents for more technical reading.
https://medium.com/@adirmashiach/impala-discussion-w-greg-rahn-the-product-manager-b2d05b28eb1e
Kafka Streams in Action is now available in print. The ePub version will be available later this week.
https://www.manning.com/books/kafka-streams-in-action
There's been an important conversation the past few days regarding the usage of certain terms for replication in data systems. After a lot of conversation and criticism, the Redis project has started an effort to change its terminology.
https://github.com/antirez/redis/issues/5335
For data engineers who are frustrated with running custom monitoring scripts and reactive troubleshooting - intermix.io is a single dashboard that helps you instantly understand Amazon Redshift performance, dependencies, and bottlenecks. Fix slow dashboards and run faster queries. See across your cluster, data apps and users. Plan ahead and grow with your data.
We are giving every DataEngWeekly reader an extended free trial. Visit https://nter.mx/dataengweeekly to start.
Apache Drill 1.14.0 was released about a month ago. This release includes support for Docker, improvements to filter pushdown in Kafka and Parquet, support for spilling hash joins to disk (to support large join tables), and much more.
https://drill.apache.org/docs/apache-drill-1-14-0-release-notes/
Cloudera announced a number of new releases, including CDH 6.0.0 (see the second link for a full list of new features), Cloudera Data Warehouse—an Impala-powered hybrid data warehouse, and Altus Data Warehouse—a managed data warehouse service for AWS and Microsoft Azure. Datanami has more details on the releases.
https://www.datanami.com/2018/08/28/cloudera-pivots-from-zoo-animals-to-data-warehousing/
https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_600_new_features.html#cdh600_new_features
Egeria is a new project from the ODPi to provide open standards for metadata and governance. The first release includes several software components that implement an open metadata repository service and adapters.
https://www.odpi.org/blog/2018/08/27/first-release-of-odpi-egeria-is-here
Apache Airflow 1.10.0 is out with a beta of Role-Based Access Control, improved Kubernetes integration via a Kubernetes Operator and Executor, new AWS and GCP integrations, and more.
https://medium.com/datareply/apache-airflow-1-10-0-released-highlights-6bbe7a37a8e1
Databricks Runtime 4.3 has been announced. It includes speedups, new functionality in Databricks delta, and improvements to structured streaming.
https://databricks.com/blog/2018/08/27/announcing-databricks-runtime-4-3.html
Version 1.2 of the data Artisans Platform has been released. It bring major new security features: SSO, RBAC, multitenancy, a secret values API, and more.
Also announced this week was the data Artisans Streaming Ledger, which adds ACID multi-stream/table transactions to Apache Flink. This post introduces the API with a bank transaction application as a motivating example. The API and a single-node implementation of the streaming ledger are open source.
https://data-artisans.com/blog/serializable-acid-transactions-on-streaming-data
Spotify has open sourced cstar
, their Cassandra orchestration tool. It provides a mechanism to run scripts across a cluster in a topologicially-aware way (e.g. don't run the same command on two nodes that share a replica). cstar is written in python.
ETL for Beginners (Denver) - Wednesday, September 12
https://www.meetup.com/Denver-All-Things-Data/events/253587370/
Building a Data Pipeline That Benefits the Entire Company (Saint Louis) - Tuesday, September 11
https://www.meetup.com/St-Louis-Machine-Learning/events/253035115/
Cleveland Big Data Mega Meetup (Cleveland) - Monday, September 10
https://www.meetup.com/Cleveland-Hadoop/events/252334859/
Finding Cost Efficiencies with Spark & AWS (Tysons) - Wednesday, September 12
https://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/253545616/
StreamSets at 2U + The StreamSets Test Framework (Brooklyn) - Monday, September 10
https://www.meetup.com/New-York-City-StreamSets-User-Group-Meetup/events/253825202/
Distributed Systems Recovery Patterns, with Denise Yu of Pivotal (New York) - Wednesday, September 12
https://www.meetup.com/time-series-nyc/events/252717914/
User Stories: Alluxio Production Use Cases with Presto and Hive (New York) - Thursday, September 13
https://www.meetup.com/Alluxio/events/254150089/
Apache Flink Meetup Year Review (London) - Monday, September 10
https://www.meetup.com/Apache-Flink-London-Meetup/events/252850181/
Apache Kafka & KSQL in Action + Distributed Tracing for Event-Driven Applications (Oslo) - Monday, September 10
https://www.meetup.com/Oslo-Kafka/events/254039906/
2nd Apache Airflow Meetup (Amsterdam) - Wednesday, September 12
https://www.meetup.com/Amsterdam-Airflow-meetup/events/253673642/
Integrating Kafka by PAYBACK and Confluent (Munich) - Wednesday, September 12
https://www.meetup.com/Apache-Kafka-Germany-Munich/events/253853218/
Dynamic Data (Skopje) - Wednesday, September 12
https://www.meetup.com/Data-Science-Macedonia/events/254101887/
September Big Data Meetup (Bucharest) - Thursday, September 13
https://www.meetup.com/Bucharest-Big-Data-Meetup/events/253613029/
RUSSIA
Kafka Connect in Practice (Moscow) - Monday, September 10
https://www.meetup.com/Moscow-Kafka-Meetup/events/254046443/
Quarterly Large-Scale Production Engineering Meetup (Bangalore) - Saturday, September 15
https://www.meetup.com/lspe-in/events/244024249/