04 November 2018
This is one of the biggest issues in some time (and I had to cut a bunch of good articles!). There's coverage of FlameGraphs for SQL queries, the various Kafka APIs and frameworks, Uber's cluster scheduling service, running Kafka on Kubernetes,
PIVOT in the upcoming Spark 2.4, testing idempotent producers in Kafka and Pulsar, and much more.
For data engineers who are frustrated with big bullshit data conferences and boring talks, we got a solution -
We are giving every DataEngWeekly reader a 20% discount code "DataEngWeekly" which can be redeemed for tickets here: http://buytickets.at/hakkalabs/192354/r/dataengweekly
Noria is a new data streaming system out of MIT that materializes SQL queries in order to service read-heavy, high performance web applications. It's similar to existing systems but has some important design and architecture differences. The implementation is written in Rust and open sourced on Github.
The Confluent blog has an in-depth example of using KSQL to analyze streaming data to detect ATM fraud. The example covers a number of core features of KSQL and Kafka, like event time processing, joining streams, enriching data based on a database, and storing data in ElasticSearch.
FlameGraphs provide a useful visual representation of what's happening in a system. They're typically used by analyzing performance of a program or the linux kernel, and this post shows how to use it to analyze performance of a database query.
An argument that machine learning model deployments have many of the same requirements as a microservice deployment and thus should follow the same continuous delivery process.
Over the years, the Kafka project has added a number of APIs and frameworks for getting data in and out of its storage. This post describes and contrasts many of them: The Kafka Producer API, Kafka Connect Source API, Kafka Streams API / KSQL, Kafka Consumer API, and Kafka Connect Sink API.
An Apache Kafka cluster has exactly one Controller broker, which is responsible for a number of housekeeping and related tasks, such as handling when a node leaves the cluster and creating/deleting topics. This post has an overview of the Controller and how it handles split brain, when a node re-joins a cluster, and more.
Uber writes about Peloton, their cluster and resource scheduling system built atop Mesos. They compare it to several other systems, describe the workloads it supports (including Spark jobs and workloads for the self-driving vehicle teams), and write about their plans to migrate from Mesos to Kubernetes.
Datadog's team of four SREs manage several data systems: PostgreSQL, Kafka, ZooKeeper, Cassandra, and ElasticSearch (and their over 40 Kafka clusters hand trillions of messages per day). In this presentation/video, they talk about how they leverage Kubernetes primitives such as liveness/readiness checks to run Kafka on k8s and perform normal cluster maintenance operations.
There's a new performance evaluation out of Hive and Presto to Hive running on MR3, the execution engine with similarities to Tez/Hive LLAP. It uses the TPC-DS benchmark, and it compares the engines across both sequential and concurrent queries on several different size clusters (up to 40 nodes and TPC-DS scale of 10TB).
Streak writes about their experience building a large scale (20K writes/sec at steady state) graph database with Google Cloud Spanner. They share both the good and some areas that need improvement. Streak has also open-sourced their Java ORM for Spanner, called Ratchet.
I somehow missed that PostgreSQL has had parallel scans for some time now—he's a good write up of how and when the optimizer determines if it should run a query in parallel.
SQL seems to be table stakes for any data product these days—whether batch or streaming. Apache Pulsar recently added a preview of SQL over data streams, and this post describes the architecture of the new feature.
Qubole has analyzed AWS instance types and performed a performance evaluation to determine which instances provide the best bang-for-buck for Presto workloads. The C4 instance family comes out on top, although it can fail some queries since it has less memory than similar-in-price instance types.
This post describes how to use the
PIVOT feature that's coming to Spark SQL in the 2.4 release.
An interesting analysis of Apache Kafka and Apache Pulsar behavior when using idempotent producers and introducing failure. Both systems perform well under the failure scenarios tested.
The dataArtisan's blog has a side-by-side comparison of SavePoints and CheckPoints in Apache Flink.
Wallaroo writes about the design of data resilience in their streaming platform. Unlike other streaming systems that rely on a local disk or a distributed file system, Wallaroo outsources is data storage to servers running an FTP-like service that supports file appends. The post goes into the details of this design.
Post a job to the Data Eng Weekly job board for $99. https://jobs.dataengweekly.com/
The AWS Big Data Blog has a post on improvements that they've made to Amazon Redshift over the past 5 months. They benchmark using TPC-DS to demonstrate a 3.5x speedup during that time.
The Apache Beam Summit was in early October, and the Beam blog has a recap of the event and links to the slides and recordings of the presentations.
Rockset is a new data platform as a service company that has emerged from stealth with an impressive round of funding. Rockset ingests data from Apache Kafka, Kinesis, Object Storage, etc. and makes it queryable via SQL.
Cloudera writes about the exploits attacking Hadoop clusters. To experience them for themselves, they put a cluster on the internet without any port restrictions, and it was exploited within minutes. Their blog has a few articles with more about what they're up to.
Apache Flink 1.5.5 and 1.6.2 were both released this week with fixes and minor improvements.
Cockroach Labs announced the availability of their managed service offering alongside version 2.1 of their database. The release of 2.1 includes improved scalability, new tooling for migrating from MySQL and PostgreSQL, and beta support for change data capture.
TimescaleDB, the open source time series database, has hit version 1.0. It's been in the release candidate phase for a couple of months and includes a lot of features that make it production-ready.
Dremio 3.0 was announced, with a new data catalog, improved security, deployments via Kubernetes and Helm, major performance improvements, and support for new data sources.
Version 0.5.4 of the Wallaroo stream processing system has been released. It is the first release with Python 3 support.
Apache Hbase 2.1.1 was released with over 60 resolved issues.
Version 3.1.1 of Apache Hive was released. It's a small changeset that includes two bug fixes and one new feature.
Curated by Datadog ( http://www.datadog.com )
Building Scalable & Reliable Real-Time Streaming Apps with Kafka and KSQL (Menlo Park) - Wednesday, November 7
Data-Intensive Systems on Kubernetes + More (Seattle) - Thursday, November 8
Data Ingestion and Entity Linking (Kitchener) - Wednesday, November 7
Security in Azure + ServiceBus/Kafka with Azure Kubernetes Service (Quebec) - Thursday, November 8
Big Data Processing with Apache Kafka and KSQL (Toulouse) - Tuesday, November 6
Processing IoT Data from End to End with MQTT and Apache Kafka (Munich) - Monday, November 5
Real-Time Stream Processing with Apache Flink (Vienna) - Wednesday, November 7
Apache Spark Meetup @ Cloudera (Budapest) - Tuesday, November 6
Autumn budapest.py Meetup (Budapest) - Thursday, November 8
Uber Engineering Tech Talk (Sofia) - Tuesday, November 6
Secure Data Flows from Edge to Core with Apache NiFi and MiNiFi (Sydney) - Monday, November 5
Apache Cassandra & Apache Kafka: The How/When/Why (Canberra) - Thursday, November 8