Data Eng Weekly Issue #279

26 August 2018

Lots of content this week including high-level articles on benchmarking, event sourcing architecture, and monitoring distributed systems as well as deep-dive articles on efficiently writing to a database and the correctness of the Dgraph distributed graph database. Also lots of releases this week, including the new Hortonworks Streams Messaging Manager. Finally, congrats to the Apache HAWQ team—that project has been promoted to a top-level Apache project.

Sponsor

Built by narwhals, just for you – Dremio simplifies data engineering and data analytics with the power of Apache Arrow. Connect almost any data source. Accelerate queries up to 1,000x. Let your BI and data science users curate their own data with our nautically-themed user interface. Open source.

Visit https://bit.ly/about-dremio to learn more, or download for free.

Technical

This post offers a good comparison between RabbitMQ and Apache Kafka for Trello's messaging use case. It also talks about the Kafka operational metrics they use to monitor the health of the system.

http://tech.trello.com/why-we-chose-kafka/

This article shares a list of practical challenges in using an event-sourcing/CQRS-based architecture. The challenges include schema changes, the inconsistency window (since reads are eventually consistent), choosing the right event granularity, and lack of ad hoc queries (e.g. as needed by a support team).

https://medium.com/@hugo.oliveira.rocha/what-they-dont-tell-you-about-event-sourcing-6afc23c69e9a

With all the different Hadoop-ecosystem execution engines these days, MR3 might not be on your radar. It has a similar design to Apache Tez, though, and it claims to have speedups through a simpler design and elastic resource allocation. This benchmark compares Presto, Hive-on-Tez (with LLAP), Hive-on-MR3, and SparkSQL across a number of queries. As always, you should try things with your own data, but these results are interesting because they are favorable for both Hive-on-Tez and Hive-on-MR3 with some differences across data/cluster size.

https://mr3.postech.ac.kr/blog/2018/08/15/comparison-llap-presto-spark-mr3/

If you're running a data system in the cloud, then you might want to try out this architecture for storing data. Rather than network attached storage (e.g. AWS EBS in this case), they optimize for performance and price by using local NVMe storage. To achieve backups, they add one replica backed by EBS for creating snapshots. There's lots more analysis in the post.

https://www.percona.com/blog/2018/08/20/using-aws-ec2-instance-store-vs-ebs-for-mysql-how-to-increase-performance-and-decrease-cost/

This post is a great overview of system health checks and how to extend them to distributed systems. There's a good discussion of the importance of measuring end-to-end health and how healthiness is not binary but a spectrum. There's an overview of measuring health of the distributed system architecture at imgix and lots of references for further reading.

https://medium.com/@copyconstruct/health-checks-in-distributed-systems-aa8a0e8c1672

The Pulsar IO framework is a new feature of version 2.1.0-incubating of Apache Pulsar, the distributed pubsub system. It's a pluggable, fault-tolerant system to for processing data within and across systems (e.g. sending data from Pulsar to Apache Cassandra). This post provides a good intro to the framework, including how to write a custom source or sink.

https://streaml.io/blog/introducing-pulsar-io

As a follow up to last week's post, Netflix writes this week about their system for scheduling and executing Jupyter notebooks as part of their data pipeline. It's built using the open-source papermill project that supports parameterizing and executing notebooks.

https://medium.com/netflix-techblog/scheduling-notebooks-348e6c14cfd6

The latest Jepsen post covers Dgraph, a distributed graph database. The article describes the Dgraph architecture and design goals, the experiments that the Jepsen framework performed, and goes through several correctness and stability issues that were discovered. Of the 23 issues found, 19 have been resolved—it's a worthwhile read that reminds just how hard it is to build a battle-tested distributed system.

https://jepsen.io/analyses/dgraph-1-0-2

The latest in a multi-part Kafka tutorial series covers KSQL. Previous posts include measuring latency, integrating with Spark Structured Streaming, and writing Kafka Streams in Kotlin.

http://aseigneurin.github.io/2018/08/22/kafka-tutorial-10-ksql.html

MapR has a tutorial for connecting to an Apache Drill cluster via a Jupyter notebook and the python Drill library.

https://mapr.com/blog/drilling-jupyter/

This article is a good overview of techniques to efficiently write batched data to a database using Java and Hibernate.

https://medium.com/@jerolba/persisting-fast-in-database-1af4a281e3a

An example of using Kafka Streams to build a real-time application based on NYC Citi Bike data.

https://medium.com/@leihetito/tracking-nyc-citi-bike-real-time-utilization-using-kafka-streams-1c0ea9e24e79

Sponsor

Astronomer helps organizations run Apache Airflow at scale. Easily deploy to a hosted service or private cloud with a full Kubernetes-based stack and ramp up your team with exclusive training and professional services.

Visit http://bit.ly/about-astronomer to learn more.

News

Apache HAWQ, the SQL on Hadoop engine, has been promoted to an Apache top-level project after nearly 3 years in incubation. HAWQ's codebase has its origins in the Pivotal Greenplum Database, and it includes a number of ecosystem integrations.

https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces38

Jobs

Mayo Clinic is looking for a Lead Hadoop System Admin based in Rochester, MN or Phoenix, AZ or Jacksonville, FL

https://jobs.dataengweekly.com/jobs/5c0e56d3-309f-4ff8-a3d4-640a64ff3bab

Submit a job to the Data Eng Weekly board at https://jobs.dataengweekly.com/

Releases

Hortonworks DataFlow 3.2 was released a week back. It's based on Apache NiFi 1.7, includes Apache Kafka 1.1.1, and supports Hive 3.

https://hortonworks.com/blog/whats-new-hortonworks-dataflow-hdf-3-2/

Apache Flink 1.5.3 is out. It's a bug fix release that resolves nearly 30 issues.

https://flink.apache.org/news/2018/08/21/release-1.5.3.html

Hortonworks has unveiled a new operational management and API tool for Apache Kafka called the Hortonworks Streams Messaging Manager (SMM). It provides a view of important Kafka metrics, a REST API, and integration with several other ecosystem services. SMM is open source with an AGPL license.

https://hortonworks.com/blog/introducing-hortonworks-streams-messaging-manager-smm/
https://github.com/hortonworks/smm_server/tree/SMM-1.0.0.0
https://github.com/hortonworks/smm_app/tree/SMM-APP-1.0.0.0

Version 0.5.2 of Wallaroo, the streaming data framework, has been released. It now supports additional Linux distributions but doesn't include changes to the runtime.

https://github.com/WallarooLabs/wallaroo/releases/tag/0.5.2

Amazon Kinesis Data Streams recently added support for HTTP/2 and "enhanced fan-out," which is a mechanism for reserving throughput for consumers in a multi-tenant setup.

https://aws.amazon.com/blogs/aws/kds-enhanced-fanout/

Data Eng Weekly

Data Eng Weekly Issue #279

Sponsor

Technical

Sponsor

News

Jobs

Releases

Sponsors

Events

California

Washington

Georgia

Pennsylvania

New York

FRANCE

HUNGARY

ESTONIA

ISRAEL

INDIA

SINGAPORE