Data Eng Weekly


Data Eng Weekly Issue #285

14 October 2018

This week, we hear from Branch on operating Kafka in the AWS cloud, CallApp on their Cassandra upgrade/migration, King on their experiences with Google BigQuery, and Twitter on how they scale ZooKeeper. There are two good posts describing internals of Apache Kafka and Wallaroo as well as plenty of other great articles.

Sponsor

For data engineers who are frustrated with running custom monitoring scripts and reactive troubleshooting - intermix.io is a single dashboard that helps you instantly understand Amazon Redshift performance, dependencies, and bottlenecks. Fix slow dashboards and run faster queries. See across your cluster, data apps and users. Plan ahead and grow with your data.

We are giving every DataEngWeekly reader an extended free trial. Visit https://rebrand.ly/dataeng-redshift-performance to start.

Technical

A neat project with Kafka, KSQL, and H2O.ai to detect anomalies in home power usage and trigger a notification.

https://medium.com/@simon.aubury/machine-learning-kafka-ksql-stream-processing-bug-me-when-ive-left-the-heater-on-bd47540cd1e8

Branch shares their experience operating Kafka in a cloud environment. They recently switched hardware types from spinning disks to NVMe. To keep disks balanced and make up for the extra storage costs, then migrated to RAID0 ZFS with lz4 compression. The learnings that they share about that migration and how they operation the system are really great.

https://medium.com/branch-engineering/optimizing-kafka-hardware-selection-within-aws-amazon-web-services-69fb4b10198

Apache Hadoop Ozone, the object storage layer atop of HDFS, is now available in an alpha release. This post describes the design goals and future work of the project.

https://hortonworks.com/blog/introducing-apache-hadoop-ozone-object-store-apache-hadoop/

Great overview of how CallApp did their migration from Cassandra 0.7 to Cassandra 3.11. Since their upgrade spans so many versions, the Cassandra upgrade tools don't support a direct upgrade. Instead, they came up with some clever solutions to load data to the new cluster, write to both clusters simultaneously, and verify that it was correctly loaded. There are a bunch of code snippets and lessons shared in the post.

https://medium.com/callapp/database-migration-at-scale-ae85c14c3621

Good overview of why Object Storage has gained popularity over distributed file systems in recent years, including the rise of Spark and better total cost of ownership.

https://blog.minio.io/modern-data-lake-with-minio-part-1-716a49499533

JavaScript might not be your first choice for processing large data files, but if you find yourself in that position then here's a primer on how to get decent performance without using too much memory. Interestingly, even file processing in Node.js is done via event handlers/callbacks when using the file stream API.

https://itnext.io/using-node-js-to-read-really-really-large-files-pt-1-d2057fe76b33

King, the makers of Candy Crush and other games, has shared their experience evaluating Google BigQuery in comparison to an on-prem Hive cluster and MPP database. Using a sample of data based on common queries, they loaded several gigabytes into Google Cloud Storage. They then benchmarked (which compared Hive, the MPP database and Impala) across several queries and concurrency levels.

https://medium.com/@TechKing/benchmarking-google-bigquery-at-scale-13e1e85f3bec

This post describes the security features of Apache Kafka 2.0 and Confluent Platform 5.0. Of note, Kafka Connect now has the ability to load externalized secrets (by writing a plugin) and ACL-prefixed wildcards can be used for access control.

https://www.confluent.io/blog/enhance-security-apache-kafka-2-0-confluent-platform-5-0

AWS has published a new white paper that details the best practices in migrating a HBase deployment to HBase-on-S3-on-EMR. There are 50 pages cover everything from the key configuration settings to performance testing to graceful cluster termination.

https://aws.amazon.com/blogs/big-data/migrate-to-apache-hbase-on-amazon-s3-on-amazon-emr-guidelines-and-best-practices/

Wat-provenence (wat is short for why-across-time) is a new proposal for data provenance tracking that can also be used for debugging distributed systems. The paper includes examples and a references a prototype implementation of a distributed debugging framework.

https://dl.acm.org/citation.cfm?doid=3267809.3267839

One of the key differentiators for Apache Pulsar is tiered storage, including support for object stores like Amazon S3. This post shows how to configure Pulsar to offload cold data to S3 and how to verify the settings.

https://streaml.io/blog/configuring-apache-pulsar-tiered-storage-with-amazon-s3

The dataArtisan's blog has a look at how Apache Flink leverages Apache Kafka semantics to ensure exactly once delivery of messages and how it handles failure.

https://data-artisans.com/blog/how-apache-flink-manages-kafka-consumer-offsets

Twitter has written about how they scale Apache ZooKeeper for their use cases, which include service discovery and leader election in distributed systems. They describe common patterns (and anti-patterns) and the tools they've built around observability, topology changes, and backup.

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2018/zookeeper-at-twitter.html

Wallaroo has an illustrated guide that explains how they've implemented snapshotting in their distributed stream processing engine. It describes the Chandry-Lamport snapshot algorithm and the Wallaroo implementation that incorporates some improvements (based on a paper about Flink's algorithm).

https://blog.wallaroolabs.com/2018/10/checkpointing-and-consistent-recovery-lines-how-we-handle-failure-in-wallaroo/

Jobs

Post a job to the Data Eng Weekly job board for $99. https://jobs.dataengweekly.com/

Releases

Google BigQuery has expanded into new regions in Europe and has implemented changes for GDPR compliance.

https://www.datanami.com/2018/10/10/google-extends-bigquery-reach-across-europe/

Confluent's Stream Processing Cookbook includes KSQL recipes, including for processing syslog data, data filtering and masking, and converting from Avro to CSV.

https://www.confluent.io/blog/ksql-recipes-available-now-stream-processing-cookbook

RAPIDS is a new suite of open source libraries from NVIDIA to add GPU acceleration to data science and machine learning pipelines.

https://www.datanami.com/2018/10/10/nvidia-platform-pushes-gpus-into-machine-learning-high-performance-data-analytics/

Built on the RAPIDS project, BlazingSQL is a new GPU-accelerated database engine. While BlazingSQL itself isn't open source, there's a free version (behind an email wall). This post has a bit of details about the features (including machine learning and deep learning libraries) and a roadmap.

https://blog.blazingdb.com/announcing-blazingsql-a-gpu-sql-engine-for-rapids-open-source-software-from-nvidia-11e115ba7dd7

Dask-jobqueue is a new library for running Dask on a HPC cluster. This post shows how to use it with Jupyter notebooks to run a job.

https://medium.com/pangeo/dask-jobqueue-d7754e42ca53

Quicksign has open sourced their end-to-end encryption for Apache Kafka. Their documentation describes their design goals, use cases, and example producer/consumer usage.

https://code.quicksign.io/kafka-encryption/
https://github.com/Quicksign/kafka-encryption

Apache Arrow 0.11.0 is released. There are a number of changes in the release, including the merger of the Parquet and Arrow C++ codebases.

http://arrow.apache.org/blog/2018/10/09/0.11.0-release/

Sponsor

For data engineers who are frustrated with running custom monitoring scripts and reactive troubleshooting - intermix.io is a single dashboard that helps you instantly understand Amazon Redshift performance, dependencies, and bottlenecks. Fix slow dashboards and run faster queries. See across your cluster, data apps and users. Plan ahead and grow with your data.

We are giving every DataEngWeekly reader an extended free trial. Visit https://rebrand.ly/dataeng-redshift-performance to start.

Events

Curated by Datadog ( http://www.datadog.com )

California

Women in Data Meetup (San Francisco) - Thursday, October 18
https://www.meetup.com/SF-Big-Analytics/events/254997685/

Washington

The Open Source Data Warehouse Killer in the Cloud (Bellevue) - Wednesday, October 17
https://www.meetup.com/Big-Data-Bellevue-BDB/events/253981013/

Utah

Modern Data Pipelines Using Kafka Streaming and Kubernetes (Salt Lake City) - Wednesday, October 17
https://www.meetup.com/utah-data-engineering-meetup/events/254979894/

Tennessee

Creating a Modern Data Engineering Infrastructure (Nashville) - Wednesday, October 17
https://www.meetup.com/Nashville-Data-Engineers/events/255212713/

New Jersey

Hadoop 3: Architecture, Features, and Improvements (Hamilton) - Thursday, October 18
https://www.meetup.com/nj-dapp/events/254826592/

Massachusetts

Apache Spark User Group: October Presentation Night (Boston) - Tuesday, October 16
https://www.meetup.com/Boston-Apache-Spark-User-Group/events/254961641/

NETHERLANDS

Upcoming Apache Spark and Data Lineage (Amsterdam) - Wednesday, October 17
https://www.meetup.com/Amsterdam-Spark/events/255387738/

GERMANY

Introduction to Kafka Streams (Stuttgart) - Tuesday, October 16
https://www.meetup.com/Stuttgart-Data-Engineers/events/255362273/

AUSTRIA

Processing Streaming Data with KSQL (Wien) - Thursday, October 18
https://www.meetup.com/Vienna-Kafka-meetup/events/254783193/

CZECH REPUBLIC

Apache Beam (Prague) - Thursday, October 18
https://www.meetup.com/CS-HUG/events/255361277/

CHINA

Kafka Shanghai Meetup (Shanghai) - Sunday, October 21
https://www.meetup.com/Shanghai-Big-Data-Streaming-Meetup/events/254606732/