Data Eng Weekly Issue #300

03 February 2019

Today's 300th issue of Data Eng Weekly covers a lot of content, including the GPU-powered analytics DB from Uber, database join implementations, CEP with Apache Flink, and new Kubernetes operators for Spark and Cassandra. In news, the agenda for the upcoming Kafka Summit has been announced, and there's an interesting post analyzing trends in this newsletter over the years.

Technical

Intermix has captured a list of articles about how companies are using Amazon Redshift to build data pipelines. There are summaries of each and several architecture diagrams with links out to the individual blog posts.

https://www.intermix.io/blog/14-data-pipelines-amazon-redshift/

A good introduction to concurrency in databases, including the MVCC architecture pattern and how concurrency relates to other performance metrics.

https://www.vividcortex.com/blog/what-is-concurrency-in-a-database

This post describes a situation in which Apache Spark's lazy computation results in counterintuitive behavior when building a relatively straightforward application.

https://blog.godatadriven.com/spark-beware

Cloudera writes about improvements to Apache Impala that improve throughput on large clusters (6x for TPC-DS) and reliability (query success is higher in the benchmark, too). The biggest change is a reduction in the number of TCP connections thanks to a new RPC framework (from the Apache Kudu project) for some operations.

https://blog.cloudera.com/blog/2019/01/scalability-improvement-of-apache-impala-2-12-0-in-cdh-5-15-0/

Uber writes about AresDB, their open source database for real-time analytics. AresDB uses GPUs to optimize performance, and they write about the motivation for this architecture as well as the main components of the system. AresDB includes column-based storage with compression, upsert capabilities, and its own query language. The article includes a good overview of the columnar storage engine and these other components.

https://eng.uber.com/aresdb/

This post walks through the MATCH_RECOGNIZE SQL command to perform complex event processing with Apache Flink. They use a taxi trip dataset as the example, and the query finds matching events for a particular ride.

https://www.da-platform.com/blog/match_recognize-where-flink-sql-and-complex-event-processing-meet

This article provides a great introduction to database joins. It demonstrates a simple join and hash join using python, shows a common query plan for a hash join, and looks at other types of joins that utilize sorted datasets and b-tree indexes.

http://blog.felipe.rs/2019/01/29/demystifying-join-algorithms/

The Last Pickle has a great checklist for deploying a new Apache Cassandra cluster. It covers important server and client configuration options, operational items like monitoring and backups, and other best practices. There are fourteen recommendations in total.

http://thelastpickle.com/blog/2019/01/30/new-cluster-recommendations.html

Cockroach has a look at how they implemented vectorized processing in their hash join executor, which resulted in up to 40x speedups. The post includes example code and some benchmark results.

https://www.cockroachlabs.com/blog/vectorized-hash-joiner/

Dropbox writes about the scalability tests that they've performed to understand the limits of their Apache Kafka clusters. The analysis is a great example of how to effectively load test a distributed system (e.g. the types of simplifying assumptions they made) as well as of how to test Kafka in particular (including the metrics that they measured to detect when the brokers become overloaded).

https://blogs.dropbox.com/tech/2019/01/finding-kafkas-throughput-limit-in-dropbox-infrastructure/

Salesforce writes about how they used Apache Kafka (managed by Heroku) to build their ChatBots platform. They describe the major benefits of their architecture: deferring high latency HTTP calls and efficient failover/reprocessing.

https://engineering.salesforce.com/building-a-scalable-event-pipeline-with-heroku-and-salesforce-2549cb20ce06

A good introduction to two mechanisms, field promotion and hashing, for efficiently storing time series data in Google Cloud BigTable (or another db system with the same design, like Apache HBase).

https://medium.com/@duhroach/cloud-bigtable-time-series-data-eecc32dd9cf2

Alibaba's fork of Apache Flink, called Blink, has a lot of work to improve performance. Under the Flink Improvement Proposal process, that work is now starting to be added back to core Apache Flink. FLIP-32 has the details of the first part of this work, which is focussed on the table and SQL APIs.

https://cwiki.apache.org/confluence/display/FLINK/FLIP-32%3A+Restructure+flink-table+for+future+contributions

A tutorial and accompanying code for using the Google Cloud Python SDK to load data to cloud storage and BigQuery.

https://hackersandslackers.com/getting-started-google-big-query-python/

The author of this post argues that SQL is often better than procedural languages for big data. The article describes how feature constraints can lead to a more powerful standard.

https://medium.com/@pankajroark/less-is-more-sql-for-bigdata-b34c2b1603ce

Jobs

Data Engineer - Python, Wooga, Berlin https://jobs.dataengweekly.com/jobs/63fbb5ea-1c49-463f-bda7-598a56a13831

News

A Data Eng Weekly reader analyzed the content and topics that I've covered in this newsletter over the past 6 years. The interesting analysis includes a couple comparisons over time (for example, Hadoop vs. Kafka), and trends in mentions by calendar year.

https://blog.marouni.fr/bidata-trends-analysis/

Tonic has a post on the provisions of the California Consumer Privacy Act, which shares some similarities to the GDPR. The post describes what the law will mean both for companies and for consumers when it comes into effect in 2020.

https://www.tonic.ai/blog/ccpa-will-hit-your-dev-team-harder-than-gdpr

The agenda for New York Kafka Summit, which takes place in April, has been announced. There are four tracks: Core Kafka, Stream Processing, Event-Driven Development, and Use Cases. Early bird pricing is available til Feb 8th.

https://www.confluent.io/blog/program-committee-has-chosen-kafka-summit-nyc-2019-agenda

The Presto Software Foundation is a new organization dedicated to the advancement of the Presto SQL engine.

https://www.datanami.com/2019/01/31/presto-backers-bolster-its-open-source-origins/

Releases

MapR has coverage of the new features in the recently released Apache Drill 1.15, which is part of the MapR Ecosystem Pack 6.1 release. Major new features include a plugin for querying data in S3, better ANSI SQL compatibility, and Parquet row group pruning. The post has details on each these and other major components of the release.

https://mapr.com/blog/mapr-announces-drill-1-15-with-s3-cloud-storage-plugin/

Zerocode is a tool for automating tests of HTTP, Kafka, and DB services. Tests scenarios are written in JSON, and which abstracts away a lot of the boilerplate in setting up a client or server.

https://github.com/authorjapps/zerocode/releases/tag/zerocode-tdd-parent-1.3.0

The Google Cloud Firestore databases, which is a serverless document database, is now generally available. It includes an uptime SLA, is available in many regions, and more.

https://cloud.google.com/blog/products/databases/announcing-cloud-firestore-general-availability-and-updates

Google has announced a new open source operator for running Apache Spark on Kubernetes. It's currently in beta, and there's a design document that describes the architecture.

https://www.zdnet.com/article/google-announces-kubernetes-operator-for-apache-spark/

Also on the Kubernetes operator front, Instaclustr has an open-source project for Apache Cassandra. It's under development and is currently labeled as alpha status.

https://github.com/instaclustr/cassandra-operator

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Streaming, TensorFlow, and Use Cases! (San Francisco) - Thursday, February 7
https://www.meetup.com/SF-Big-Analytics/events/258174292/

Missouri

Docker on Hadoop (Saint Louis) - Wednesday, February 6
https://www.meetup.com/St-Louis-Big-Data-IDEA/events/257350310/

New York

Stream Processing at Scale + Cloud-Scale Read/Write SQL Caching (New York) - Wednesday, February 6
https://www.meetup.com/NYC-In-Memory-Computing-Meetup/events/258338807/

CANADA

How to Structure Your Data Pipeline + New Kafka Features You Might Not Know (Vancouver) - Tuesday, February 5
https://www.meetup.com/vancouver-kafka/events/257110501/

SWEDEN

Apache Kafka in the Cloud for Busy Software Engineers (Stockholm) - Tuesday, February 5
https://www.meetup.com/Stockholm-Apache-Kafka-Meetup-by-Confluent/events/258189320/

FRANCE

Do Microservices Dream about CQRS, Kafka Stream and BPMN? (Echirolles) - Monday, February 4
https://www.meetup.com/AlpesJUG/events/258464423/

Let's Go Back to the Basic Properties of Apache Kafka (Nanterre) - Tuesday, February 5
https://www.meetup.com/AXA-Meetup-PARIS/events/258132084/

NETHERLANDS

9th Apache Kafka Meetup (Utrecht) - Thursday, February 7
https://www.meetup.com/Kafka-Meetup-Utrecht/events/257703587/

GERMANY

Data Day #1 (Hamburg) - Tuesday, February 5
https://www.meetup.com/datadayhh/events/257282958/

POLAND

SMACK Reference Architecture + Zero-Downtime Deployment (Wroclaw) - Wednesday, February 6
https://www.meetup.com/zJAVA-w-Objectivity/events/258291872/

Data Meetup #2 (Warszawa) - Thursday, February 7
https://www.meetup.com/Data-Boardgames/events/258185622/

AUSTRALIA

Information on Time: Remote Data Ingestion and Transformation with NiFi/MiNiFi and IoT (Melbourne) - Tuesday, February 5
https://www.meetup.com/Big-Data-Analytics-Meetup-Group/events/258414497/

Apache Kafka + Advanced Cybersecurity Analytics (Canberra) - Thursday, February 7
https://www.meetup.com/Canberra-Big-Data-Meetup/events/257195644/