Data Eng Weekly Issue #278

19 August 2018

Qubole and Datadog open sourced new tools this week for Spark and Kafka (respectively). In tech, great articles to learn from Pandora, Netflix, Instacart, JW Player, and Rezdy about how they're solving data challenges. A couple of technical deep dives and tutorials on KSQL UDFs and Airflow testing round things out.

Sponsor

From the creators of Apache Arrow, Dremio is an open source Data-as-a-Service platform. Accelerate your queries (up to 1,000x!) and make data truly self-service for your BI and data science users.

Visit https://bit.ly/about-dremio to learn more, or download for free.

Technical

Confluent writes about building a User Defined (Aggregate) Function for KSQL, which is a new feature of their latest release. The post contains example code and anticipates some development challenges and their solutions.

https://www.confluent.io/blog/build-udf-udaf-ksql-5-0

LocusDB is an experimental analytics database written in Rust and built on RocksDB. This post describes the internals of the system that enable it, using some columnar encoding and compression tricks, to provide impressive throughput.

https://clemenswinter.com/2018/08/13/how-read-100s-of-millions-of-records-per-second-from-a-single-disk/

Pandora has written about using MemSQL as their analytics database. The post covers the goals of their analytics db (which replaced Hadoop), some of the tools they evaluated, the data design, and the system configuration (e.g. using RAID 10). MemSQL supports both columnar and row-based storage, and there's an interesting discussion of the tradeoffs to consider when deciding how to store a dataset.

https://engineering.pandora.com/using-memsql-at-pandora-79a86cb09b57

The Netflix Data team supports Jupyter notebooks as a first class component of their data pipeline. They're scheduled by data scientists, data engineers, software engineers, and data analysts. In this post, they describe how they support the team's use cases and the infrastructure that powers it.

https://medium.com/netflix-techblog/notebook-innovation-591ee3221233

Scylla is an open-source (AGPL) distributed database with Apache Cassandra capability. It's a different codebase, written in C++, and thus there are some architectural differences. This post describes the internals of its data replication (or streaming) and how they plan to improve the performance in an upcoming release.

https://www.scylladb.com/2018/08/14/upcoming-improvements-scylla-streaming/

This post describes how Instacart split up their databases for isolation and scalability. For coupled components, they implemented asynchronous data replication so that db joins still work. This new design also helped them find and eliminate places in which multiple components were writing to the same tables. The post describes the steps they took to roll out the new design and migrate data (including a nifty pgsync tool for data replication).

https://tech.instacart.com/scaling-at-instacart-distributing-data-across-multiple-postgres-databases-with-rails-13b1e4eba202

The Rezdy data team shares their core philosophies and the tools they're using to implement a new data platform. Lots of good tips for how to build a platform that serves an entire organization.

https://medium.com/rezdy-engineering/an-introduction-to-data-at-rezdy-53b12d9935f5

In this post, an Apache Impala power-user shares some pain points / missing features identified from running Impala at scale.

https://medium.com/@adirmashiach/5-main-missing-features-in-impala-imo-1343c767081f

The JW Player team, which analyzes hundreds of GBs of playback data per day, has written about their experiences tuning an expensive Spark job. The post has a good explanation of several tuning knobs, including shuffle partitions and broadcast joins.

https://medium.com/jw-player-engineering/optimizing-spark-sql-performance-in-video-play-sessions-d49bfcca59b7

This post has a good overview of the features (and maturity of those features) of Azure Event Hubs. It also discusses some of the trade-offs vs. running your own Kafka cluster (Event Hubs supports the Kafka 1.x client protocol) and some performance tips.

https://medium.com/@yvescallaert/azure-event-hubs-the-good-the-bad-and-the-ugly-5b1120b8b9c2

A good primer on testing with Apache Airflow, this article covers testing DAG validity (e.g. no cycles), testing DAG definition (e.g. correct dependencies), and unit testing an operator.

https://medium.com/@chandukavar/testing-in-airflow-part-1-dag-validation-tests-dag-definition-tests-and-unit-tests-2aa94970570c

Sponsor

Astronomer helps organizations run Apache Airflow at scale. Easily deploy to a hosted service or private cloud with a full Kubernetes-based stack and ramp up your team with exclusive training and professional services.

Visit http://bit.ly/about-astronomer to learn more.

News

There's a new monthly Presto newsletter from the folks at Starburst. The first issue has a good collection of videos, posts, and presentations.

https://www.starburstdata.com/newsletter/presto-newsletter-1/

If you're looking for some good data systems papers to read, the August proceedings of the PVLDB are out. Among the papers are ones on building an Apache Beam runner for IBM Streams, data quality verification, streaming joins at Facebook, Google's F1 query engine, and Alibaba's distributed file system PolarFS.

http://www.vldb.org/pvldb/vol11.html

Jobs

Mayo Clinic is looking for a Lead Hadoop System Admin based in Rochester, MN or Phoenix, AZ or Jacksonville, FL

https://jobs.dataengweekly.com/jobs/5c0e56d3-309f-4ff8-a3d4-640a64ff3bab

Submit a job to the Data Eng Weekly board at https://jobs.dataengweekly.com/

Releases

Qubole has open sourced their tool for profiling and predicting performance of Apache Spark jobs.

https://www.qubole.com/blog/sparklens-0-2-0-release-features-and-fixes/

Kafka-Kit is a new open source tool from Datadog. It includes two tools: topicmappr, which is a rack-aware tool for reassigning partitions and updating replication factors, and autothrottle, which is a tool for preventing service degradation during a replication event by adjusting Kafka's replication throttle.

https://www.datadoghq.com/blog/engineering/introducing-kafka-kit-tools-for-scaling-kafka/

Version 0.5.7 of Scio, the Scala library for Apache Beam, has been released. This new version includes support for Beam 2.6.0 and other improvements and bug fixes.

https://github.com/spotify/scio/releases/tag/v0.5.7

FASTER is a new key-value store implementation with a design and architecture that improves performance.

https://github.com/Microsoft/FASTER

Azure HDInsight has a new integration between Apache Phoenix and Apache Zeppelin.

https://blogs.msdn.microsoft.com/ashish/2018/08/17/apache-phoenix-now-supports-zeppelin-in-azure-hdinsight/

Data Eng Weekly

Data Eng Weekly Issue #278

Sponsor

Technical

Sponsor

News

Jobs

Releases

Sponsors

Events

UNITED STATES

Arizona

Texas

Illinois

CANADA

MEXICO

AUSTRALIA