Data Eng Weekly


Data Eng Weekly Issue #288

04 November 2018

This is one of the biggest issues in some time (and I had to cut a bunch of good articles!). There's coverage of FlameGraphs for SQL queries, the various Kafka APIs and frameworks, Uber's cluster scheduling service, running Kafka on Kubernetes, PIVOT in the upcoming Spark 2.4, testing idempotent producers in Kafka and Pulsar, and much more.

Featured Event

For data engineers who are frustrated with big bullshit data conferences and boring talks, we got a solution - is the first community powered No-Bullshit Data event designed for and by data engineers and scientists. Taking place this year at Columbia University in NYC on Nov 8-9th. They have 4 dedicated tracks this year: Data Engineering, Data Science, AI Products and the brand new Hero Engineering. Join geeks from the west & east coast within companies like Facebook, Salesforce, Netflix, WeWork, MIT, Beeswax, Lyft, Stitch Data, Datadog, Segment, Starburst, Datacoral, Columbia University, Capital One, TapRecruit, Figure Eight, Dia&Co and many more.

We are giving every DataEngWeekly reader a 20% discount code "DataEngWeekly" which can be redeemed for tickets here: http://buytickets.at/hakkalabs/192354/r/dataengweekly

Technical

Noria is a new data streaming system out of MIT that materializes SQL queries in order to service read-heavy, high performance web applications. It's similar to existing systems but has some important design and architecture differences. The implementation is written in Rust and open sourced on Github.

https://blog.acolyer.org/2018/10/29/noria-dynamic-partially-stateful-data-flow-for-high-performance-web-applications/

The Confluent blog has an in-depth example of using KSQL to analyze streaming data to detect ATM fraud. The example covers a number of core features of KSQL and Kafka, like event time processing, joining streams, enriching data based on a database, and storing data in ElasticSearch.

https://www.confluent.io/blog/atm-fraud-detection-apache-kafka-ksql

FlameGraphs provide a useful visual representation of what's happening in a system. They're typically used by analyzing performance of a program or the linux kernel, and this post shows how to use it to analyze performance of a database query.

https://blog.tanelpoder.com/posts/visualizing-sql-plan-execution-time-with-flamegraphs/

An argument that machine learning model deployments have many of the same requirements as a microservice deployment and thus should follow the same continuous delivery process.

https://riccomini.name/models-microservices-should-be-using-same-continuous-delivery-stack

Over the years, the Kafka project has added a number of APIs and frameworks for getting data in and out of its storage. This post describes and contrasts many of them: The Kafka Producer API, Kafka Connect Source API, Kafka Streams API / KSQL, Kafka Consumer API, and Kafka Connect Sink API.

https://medium.com/@stephane.maarek/the-kafka-api-battle-producer-vs-consumer-vs-kafka-connect-vs-kafka-streams-vs-ksql-ef584274c1e

An Apache Kafka cluster has exactly one Controller broker, which is responsible for a number of housekeeping and related tasks, such as handling when a node leaves the cluster and creating/deleting topics. This post has an overview of the Controller and how it handles split brain, when a node re-joins a cluster, and more.

https://hackernoon.com/apache-kafkas-distributed-system-firefighter-the-controller-broker-1afca1eae302

Uber writes about Peloton, their cluster and resource scheduling system built atop Mesos. They compare it to several other systems, describe the workloads it supports (including Spark jobs and workloads for the self-driving vehicle teams), and write about their plans to migrate from Mesos to Kubernetes.

https://eng.uber.com/peloton/

Datadog's team of four SREs manage several data systems: PostgreSQL, Kafka, ZooKeeper, Cassandra, and ElasticSearch (and their over 40 Kafka clusters hand trillions of messages per day). In this presentation/video, they talk about how they leverage Kubernetes primitives such as liveness/readiness checks to run Kafka on k8s and perform normal cluster maintenance operations.

https://speakerdeck.com/brouberol/scaling-and-operating-kafka-in-kubernetes
https://www.youtube.com/watch?v=7eEOzguzOg0

There's a new performance evaluation out of Hive and Presto to Hive running on MR3, the execution engine with similarities to Tez/Hive LLAP. It uses the TPC-DS benchmark, and it compares the engines across both sequential and concurrent queries on several different size clusters (up to 40 nodes and TPC-DS scale of 10TB).

https://mr3.postech.ac.kr/blog/2018/10/30/performance-evaluation-0.4/

Streak writes about their experience building a large scale (20K writes/sec at steady state) graph database with Google Cloud Spanner. They share both the good and some areas that need improvement. Streak has also open-sourced their Java ORM for Spanner, called Ratchet.

https://cloud.google.com/blog/products/databases/how-streak-built-a-graph-database-on-cloud-spanner-to-wrangle-billions-of-emails
https://github.com/StreakYC/Ratchet

I somehow missed that PostgreSQL has had parallel scans for some time now—he's a good write up of how and when the optimizer determines if it should run a query in parallel.

https://rafiasabih.blogspot.com/2018/10/using-parallel-sequential-scan-in.html

SQL seems to be table stakes for any data product these days—whether batch or streaming. Apache Pulsar recently added a preview of SQL over data streams, and this post describes the architecture of the new feature.

https://streaml.io/blog/querying-data-streams-with-apache-pulsar-sql

Qubole has analyzed AWS instance types and performed a performance evaluation to determine which instances provide the best bang-for-buck for Presto workloads. The C4 instance family comes out on top, although it can fail some queries since it has less memory than similar-in-price instance types.

https://www.qubole.com/blog/presto-performance-for-ad-hoc-workloads-on-aws-instance-types/

This post describes how to use the PIVOT feature that's coming to Spark SQL in the 2.4 release.

https://databricks.com/blog/2018/11/01/sql-pivot-converting-rows-to-columns.html

An interesting analysis of Apache Kafka and Apache Pulsar behavior when using idempotent producers and introducing failure. Both systems perform well under the failure scenarios tested.

https://jack-vanlightly.com/blog/2018/10/25/testing-producer-deduplication-in-apache-kafka-and-apache-pulsar

The dataArtisan's blog has a side-by-side comparison of SavePoints and CheckPoints in Apache Flink.

https://data-artisans.com/blog/differences-between-savepoints-and-checkpoints-in-flink

Wallaroo writes about the design of data resilience in their streaming platform. Unlike other streaming systems that rely on a local disk or a distributed file system, Wallaroo outsources is data storage to servers running an FTP-like service that supports file appends. The post goes into the details of this design.

https://blog.wallaroolabs.com/2018/11/the-treacherous-tangle-of-redundant-data-resilience-for-wallaroo/

Jobs

Post a job to the Data Eng Weekly job board for $99. https://jobs.dataengweekly.com/

News

The AWS Big Data Blog has a post on improvements that they've made to Amazon Redshift over the past 5 months. They benchmark using TPC-DS to demonstrate a 3.5x speedup during that time.

https://aws.amazon.com/blogs/big-data/performance-matters-amazon-redshift-is-now-up-to-3-5x-faster-for-real-world-workloads/

The Apache Beam Summit was in early October, and the Beam blog has a recap of the event and links to the slides and recordings of the presentations.

https://beam.apache.org/blog/2018/10/31/beam-summit-aftermath.html

Rockset is a new data platform as a service company that has emerged from stealth with an impressive round of funding. Rockset ingests data from Apache Kafka, Kinesis, Object Storage, etc. and makes it queryable via SQL.

https://www.datanami.com/2018/11/01/rockset-sql-cloud-service-emerges-from-stealth/

Cloudera writes about the exploits attacking Hadoop clusters. To experience them for themselves, they put a cluster on the internet without any port restrictions, and it was exploited within minutes. Their blog has a few articles with more about what they're up to.

http://blog.cloudera.com/blog/2018/11/protecting-hadoop-clusters-from-malware-attacks/
https://rockset.com/blog/

Releases

Apache Flink 1.5.5 and 1.6.2 were both released this week with fixes and minor improvements.

https://flink.apache.org/news/2018/10/29/release-1.5.5.html
https://flink.apache.org/news/2018/10/29/release-1.6.2.html

Cockroach Labs announced the availability of their managed service offering alongside version 2.1 of their database. The release of 2.1 includes improved scalability, new tooling for migrating from MySQL and PostgreSQL, and beta support for change data capture.

https://www.cockroachlabs.com/blog/launching-managed-cockroachdb/
https://www.cockroachlabs.com/blog/cockroachdb-2dot1-release/

TimescaleDB, the open source time series database, has hit version 1.0. It's been in the release candidate phase for a couple of months and includes a lot of features that make it production-ready.

https://blog.timescale.com/1-0-enterprise-production-ready-time-series-database-open-source-d32395a10cbf

Dremio 3.0 was announced, with a new data catalog, improved security, deployments via Kubernetes and Helm, major performance improvements, and support for new data sources.

https://docs.dremio.com/release-notes/30-release-notes.html

Version 0.5.4 of the Wallaroo stream processing system has been released. It is the first release with Python 3 support.

https://github.com/WallarooLabs/wallaroo/releases/tag/0.5.4

Apache Hbase 2.1.1 was released with over 60 resolved issues.

https://lists.apache.org/thread.html/2f2b9858bb71b21a98ae87ab84065293b89de7384c2e90694ce579e1@%3Cannounce.apache.org%3E

Version 3.1.1 of Apache Hive was released. It's a small changeset that includes two bug fixes and one new feature.

https://lists.apache.org/thread.html/9c293779deee276d8d364e619c22b3974283bc11326affa652955b2c@%3Cannounce.apache.org%3E

Events

Curated by Datadog ( http://www.datadog.com )

California

Building Scalable & Reliable Real-Time Streaming Apps with Kafka and KSQL (Menlo Park) - Wednesday, November 7
https://www.meetup.com/KafkaBayArea/events/255670377/

Washington

Data-Intensive Systems on Kubernetes + More (Seattle) - Thursday, November 8
https://www.meetup.com/Women-in-Infrastructure-Seattle-Chapter/events/255790869/

CANADA

Data Ingestion and Entity Linking (Kitchener) - Wednesday, November 7
https://www.meetup.com/Waterloo-Kitchener-Elasticsearch-Meetup/events/255677927/

Security in Azure + ServiceBus/Kafka with Azure Kubernetes Service (Quebec) - Thursday, November 8
https://www.meetup.com/AzureQC/events/253300179/

FRANCE

Big Data Processing with Apache Kafka and KSQL (Toulouse) - Tuesday, November 6
https://www.meetup.com/Tlse-Data-Science/events/255933453/

GERMANY

Processing IoT Data from End to End with MQTT and Apache Kafka (Munich) - Monday, November 5
https://www.meetup.com/Apache-Kafka-Germany-Munich/events/255747151/

AUSTRIA

Real-Time Stream Processing with Apache Flink (Vienna) - Wednesday, November 7
https://www.meetup.com/futureofdata-vienna/events/254138753/

HUNGARY

Apache Spark Meetup @ Cloudera (Budapest) - Tuesday, November 6
https://www.meetup.com/Budapest-Spark-Meetup/events/255917134/

Autumn budapest.py Meetup (Budapest) - Thursday, November 8
https://www.meetup.com/budapest-py/events/255415841/

BULGARIA

Uber Engineering Tech Talk (Sofia) - Tuesday, November 6
https://www.meetup.com/Uber-Engineering-Events-Sofia/events/255829953/

AUSTRALIA

Secure Data Flows from Edge to Core with Apache NiFi and MiNiFi (Sydney) - Monday, November 5
https://www.meetup.com/futureofdata-sydney/events/255597024/

Apache Cassandra & Apache Kafka: The How/When/Why (Canberra) - Thursday, November 8
https://www.meetup.com/Canberra-Big-Data-Meetup/events/255253044/