Data Eng Weekly


Data Eng Weekly Issue #286

21 October 2018

Several great architecture posts this week covering Apache Hadoop Ozone, Pravega, Alibaba's distributed file system, FaunaDB and Apache Pulsar. There are also interesting posts on Uber's data platform, Wallaroo, Apache Airflow, and more. In news, there's a CFP, an upcoming conference, videos from StrangeLoop, and keynote videos from Kafka Summit. Lots of great stuff to read/watch!

Technical

This presentation provides a good introduction to Apache Airflow (it's features, terminology, and more) and Embulk, a tool for bulk loading data between data sources.

https://docs.google.com/presentation/d/1LxwJIA2BFbGtPtdaUaGp1mngYWUQDMrM1V3DZ8AneR0/edit#slide=id.p

This post describes how the various tiers of storage/cache work in an Apache Pulsar system and how common scenarios (writes, catch up reads, etc) interact with the cache.

https://streaml.io/blog/access-patterns-and-tiered-storage-in-apache-pulsar

The Hortonworks blog has a post about Apache Hadoop Ozone. Lots of interesting pieces to the architecture, including the usage of RocksDB for storage, Apache Ratis (which is a Java implementation of RAFT), and more. The Ozone architecture breaks significantly from HDFS—in fact, there's a proposal in the works to replace HDFS' block storage layer with that from Ozone.

https://hortonworks.com/blog/apache-hadoop-ozone-object-store-architecture/

Azure CosmosDB has a MongoDB API, which can be leveraged alongside the MongoDB Kafka Connect plugin to implement change data capture into Apache Kafka.

https://medium.com/@hpgrahsl/connecting-apache-kafka-to-azure-cosmosdb-part-i-da57f73c35fa

This tutorial shows how to provision a Wallaroo cluster using Pulumi and Ansible. With this setup, they show good speedups in a CSV workload across 4, 8, and 16 servers.

https://blog.wallaroolabs.com/2018/10/spinning-up-a-wallaroo-cluster-is-easy/

In this post, there's a great list of features, with explanation, that are important in a production ready workflow engine. Many of the descriptions include anecdotes and examples that were hard-won by running production systems. If you're ever inclined to write your own workflow engine, I'd suggest reading this one to get a good idea of what you're in for.

https://medium.com/the-prefect-blog/pipeline-pitfalls-57fe558cd76

Pravega is a streaming storage engine with similar features to Apache Kafka and Apache Pulsar. This post looks at the internals of the system—the controller that is responsible for high-level cluster operations, the segment store which supports two tiers of storage, and more.

http://blog.pravega.io/2018/10/17/pravega-internals/

Uber writes about the evolution of their data infrastructure from one based on only Vertica to one based on Hadoop and Vertica to one based one that supports incremental inserts, updates, & deletes and incorporates Kafka to improve latency and performance. They've built an ingestion service to consolidate the logic of importing changelog data and maintaining incremental and "latest" views of a table.

https://eng.uber.com/uber-big-data-platform/

Pangu is Alibaba's distributed file system. This post provides a high-level introduction to its architecture and design goals, which include compatibility with the Hadoop FileSystem API.

https://medium.com/@Alibaba_Cloud/pangu-the-high-performance-distributed-file-system-by-alibaba-cloud-6c189d120710

Indeed has written a series on their metrics and insights system. The first describes Imhotep, the open-source system that's the core of the Indeed data platform. Subsequent posts cover how they use Imhotep in their workflow, and there's an example application based on analyzing the ASF JIRA dataset. They even have a demo site where you can try out some queries on the JIRA data.

https://medium.com/indeed-engineering/imhotep-scalable-efficient-and-fast-a4e320b87a74

If you've worked with a modern streaming system, you've inevitable heard of event time and watermarks. This post from the data Artisan's team is an easy-to-understand explanation of these important concepts.

https://data-artisans.com/blog/watermarks-in-apache-flink-made-easy

The FaunaDB transaction protocol is much different than most distributed databases. It avoids many of the challenges in distributed consensus (such as having high precision clocks) by batching transactions. This post has a great overview of the protocol, including many illustrations that help with understanding how the system behaves.

https://fauna.com/blog/consistency-without-clocks-faunadb-transaction-protocol

Common Table Expressions really help with readability of SQL queries, and they're supported by lots of databases. If you're not familiar, here's a good intro.

https://dev.to/helenanders26/why-you-should-use-sql-ctes-25lk

These two blog posts present a deep dive into Apache Pulsar. The first covers architecture (including how BookKeeper, ZooKeeper and Pulsar components fit together), Pulsar's read and write semantics, and how it compares to Apache Kafka and RabbitMQ in common failure scenarios. The second post induces failure in Pulsar to verify correctness of reads and writes during fail-over.

https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar-works
https://jack-vanlightly.com/blog/2018/10/21/how-to-not-lose-messages-on-an-apache-pulsar-cluster

Jobs

Post a job to the Data Eng Weekly job board for $99. https://jobs.dataengweekly.com/

News

This post introduces the concept of the "negative data engineering," which is the defensive code and rules that a data engineer ends up spending lots of time on.

https://medium.com/the-prefect-blog/positive-and-negative-data-engineering-a02cb497583d

MongoDB, which had been licensed under the AGPLv3 license, has been relicensed to make it more difficult for cloud vendors to offer a MongoDB service without open sourcing their changes.

https://techcrunch.com/2018/10/16/mongodb-switches-up-its-open-source-license/

ZDNet has an interview with Confluent CEO Jay Kreps on the mainstream adoption of Apache Kafka, the Hortonworks and Cloudera merger, and more.

https://www.zdnet.com/article/pretty-low-level-pretty-big-deal-apache-kafka-and-confluent-open-source-go-mainstream/

The Call for Papers for Big Data Technology Warsaw Summit, which takes place in Warsaw next February, has been extended. Submissions are now accepted through October 25th.

https://bigdatatechwarsaw.eu/cfp/

Data Eng Conf takes place November 8th and 9th in NYC. Tickets are on sale now and go up in price later this week.

https://www.dataengconf.com/nyc-event-2018

Videos from Strange Loop 2018 have been posted online. There are lots of interesting talks on distributed systems, stream processing, data pipelines, optimizing Spark, and more.

https://www.youtube.com/playlist?list=PLcGKfGEEONaBUdko326yL6ags8C_SYgqH

Last week was Kafka Summit SF. Keynotes from Martin Kleppmann of the University of Cambridge, Jay Kreps of Confluent, and Chris D'Agostino of Capital One have been posted. There's also a panel with folks from Microsoft and Slack.

https://www.youtube.com/playlist?list=PLa7VYi0yPIH3il2suxtu71HPYnH0sck0q

Releases

HUE 4.3 is released. It includes lots of UX improvements to the SQL editor, improved dashboard layouts, and more.

http://gethue.com/hue-4-3-and-its-app-building-improvements-are-out/

Apache Beam 2.7.0 is out with new Kudo, Amazon SNS, and Amazon SQS integrations, experimental support for Python on local Flink, and more.

https://lists.apache.org/thread.html/abb51d6ecf20235b1a3eea6fdd781f591cdf48e1c4e4c1f4a1d91606@%3Cannounce.apache.org%3E

A maintenance release of librdkafka, the C/C++ library for Apache Kafka, was announced this week. It includes a number of bug fixes and enhancements.

https://github.com/edenhill/librdkafka/releases/tag/v0.11.6

PostgreSQL 11 was released earlier this week. This Postgres blog has an overview of the improvements and new features in th enew version.

https://www.postgresql.org/about/news/1894/

Events

Curated by Datadog ( http://www.datadog.com )

California

Stream Processing with Apache Kafka & Apache Samza (Sunnyvale) - Tuesday, October 23
https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/255016589/

Minnesota

StreamSets at Cargill and phData (Hopkins) - Tuesday, October 23
https://www.meetup.com/Twin-Cities-StreamSets-User-Group/events/255176133/

Georgia

Druid: Operational Analytics for Event Data (Sandy Springs) - Tuesday, October 23
https://www.meetup.com/Atlanta-Hadoop-Users-Group/events/254790608/

Virginia

Using Apache NiFi to Integrate Data into a NoSQL Multi-Model Database (Arlington) - Wednesday, October 24
https://www.meetup.com/TechTalkDC/events/255008869/

CANADA

Information Session II: Introduction to Hadoop Ecosystem (Toronto) - Saturday, October 27
https://www.meetup.com/tordatascience/events/255451628/

BRAZIL

ETL and DW 3.0 with Azure Databricks Delta (Brasilia) - Wednesday, October 24
https://www.meetup.com/BSB-AI-Big-Data-Analytics/events/255611744/

UNITED KINGDOM

Clickstream Processing at the Financial Times (London) - Thursday, October 25
https://www.meetup.com/Apache-Flink-London-Meetup/events/254473865/

FINLAND

How-To Datalake and Spark (Helsinki) - Tuesday, October 23
https://www.meetup.com/Microsoft-BI-Power-BI-User-Group-Finland/events/255253184/

Apache Kafka Meetup @ Paf (Helsinki) - Thursday, October 25
https://www.meetup.com/Helsinki-Apache-Kafka-Meetup/events/255224651/

FRANCE

Paris Data Eng (Paris) - Tuesday, October 23
https://www.meetup.com/Paris-Data-Engineers-Meetup/events/254858236/

Apache Kafka and Streams Messaging Manager (Paris) - Thursday, October 25
https://www.meetup.com/futureofdata-paris/events/255320302/

GERMANY

Zeebe Meets Confluent: Taming Event-Driven Architectures (Berlin) - Monday, October 22
https://www.meetup.com/Berlin-Apache-Kafka-Meetup-by-Confluent/events/254703597/

POLAND

Big Data on Kubernetes + Apache Beam: What Do I Gain? (Warszawa) - Thursday, October 25
https://www.meetup.com/warsaw-hug/events/255227113/

CROATIA

Kafka Streams and Monitoring (Zagreb) - Thursday, October 25
https://www.meetup.com/Zagreb-Kafka/events/254887995/