Data Eng Weekly Issue #306

24 March 2019

Lots of coverage of Apache Spark this week as well as articles on Apache Kudu, Postgres, Apache Kafka Streams and more. And in releases, MR3 had a new release, kaf is an interesting new CLI tool for Kafka, and Prefect has open sourced the core library of their workflow engine.

Technical

This post describes several aspects of running the BigQuery Kafka Connect plugin, such as rate limiting, how the connector handles deletes, and handling data deduplication (or lack thereof).

https://blog.softwaremill.com/top-6-insights-you-should-know-before-using-the-kafka-connect-bigquery-sink-connector-7c5e5b174de6

The Apache Kudu blog has an overview of the KuduTestHarness or testing JVM applications that are using Kudu.

https://kudu.apache.org/2019/03/19/testing-apache-kudu-applications-on-the-jvm.html

The pgDash blog has a post on the configurations for CPU, memory, network, and more that can be tweaked to horizontally scale your Postgres deployment on beefier hardware.

https://pgdash.io/blog/scaling-postgres.html

A look at using Mode, a hosted service for Python and R Notebooks, with data in Amazon Redshift. The post also describes the evolution of BI workflows at most companies and how to use Fivetran to ETL data into Redshift.

https://aws.amazon.com/blogs/big-data/build-a-modern-analytics-stack-optimized-for-sharing-and-collaborating-with-mode-and-amazon-redshift/

Databricks Delta supports the SQL MERGE command, which can be used to update records in a Databricks Data Lake. Their post covers items like deleting records for GDPR compliance and applying updates based on change data capture.

https://databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html

The Qubole blog has a post on the Kinesis Connector for Spark Structured Streaming, which covers the connector's architecture and has an example of a streaming job.

https://www.qubole.com/blog/kinesis-connector-for-structured-streaming/

This post provides an extensive overview of Apache Spark's windowing functions for things like ranks within an ordered window, lag & lead to compare to previous/next/other values in the window, and more. The post provides code to illustrate each of these by executing against a sample data set.

https://knockdata.github.io/spark-window-function/

The Confluent blog looks at the new Suppress operation in Apache Kafka Streams, which can be used to simplify use cases like alerting. Suppress supports both a time delay and waiting until a particular window has closed to trigger an alert. The post also covers some considerations related to in memory buffering.

https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers

When running an EMR cluster, you have the option of storing data in S3 or in HDFS on the cluster. This article describes a number of options for copying data between S3 and HDFS, and it shows how (based on a couple of simple Presto queries) querying data in HDFS can be much faster.

https://tech.marksblogg.com/faster-file-distribution-hadoop-hdfs-s3.html

This article covers many of the main performance tuning parameters of an Apache Spark job, such as dynamic allocation, parallelism, and speculative execution.

https://medium.com/datakaresolutions/key-factors-to-consider-when-optimizing-spark-jobs-72b1a0dc22bf

An intro to the new EXCEPT ALL and INTERCEPT ALL SQL operations in Apache Spark 2.4.0.

https://www.waitingforcode.com/apache-spark-sql/apache-spark-2.4.0-features-except-all-intersect-all/read

This post looks at building a MapReduce framework form scratch on Kubernetes. The system is written in Go and uses HTTP for transport. While far from a production system, it's interesting to see what building MapReduce from first principles in Kubernetes might look like.

https://medium.com/digitalwing/development-of-a-distributed-computing-system-based-on-mapreduce-and-kubernetes-837fc7f112f9

Jobs

Senior Data Engineer (Spark), N26, Berlin https://jobs.dataengweekly.com/jobs/c202d274-c3df-4274-9ffb-77a749db5c3f

Software Engineer - Data Platform, Fitbit, San Francisco, CA https://jobs.dataengweekly.com/jobs/ab98f336-41d9-455f-b0a7-f9d5975e5975

News

Datapractices.org is joining the LInux Foundation. The project enumerates and principles (one example is "Recognize and mitigate bias in ourselves and in the data we use.") and has a courseware on github.

https://www.linuxfoundation.org/blog/2019/03/datapractices-org-joins-the-linux-foundation-to-advance-best-practices-offers-open-courseware-across-data-ecosystem/

The International Conference on Extending Database Technology is this week, and the papers have been posted online. Lots of interesting content including on Spark SQL and KSQL.

https://openproceedings.org/html/pages/2019_edbt.html

Releases

A new release of MR3, the execution engine for Apache Hive, has been released. The project has posted some performance comparisons, too.

https://groups.google.com/forum/#!msg/hive-mr3/EyZeAuBH_FQ/BdhbMPnGBwAJ
https://mr3.postech.ac.kr/blog/2019/03/22/performance-evaluation-0.6/

Kaf is a command line utility for Apache Kafka written in Golang. The authors credit kubectl and docker for inspiration.

https://github.com/infinimesh/kaf/releases/tag/v0.1.14

Apache NiFi 1.9.1 is out. It's a maintenance release with improvements to SFTP, JSON record readers, and more.

https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.9.1

Version 2.6.1 of Apache Kylin, the Distributed Analytics Engine, is out. There are over 25 issues included in the release.

https://lists.apache.org/thread.html/d6871e77bae236569492e00f42f28bb1b69bbf2a82ace2d2a176488e@%3Cannounce.apache.org%3E

Version 0.6.0 of Apache NiFi MiNiFi was released. Among the features are support for natively written python processors and a new structured logging library.

https://lists.apache.org/thread.html/a66281bbee6f4aec10f5aea963dc5aaf1e48e4dcbbee73eae7d5e37e@%3Cannounce.apache.org%3E

Prefect has open sourced their workflow engine core library. Much more about the features of the library and the platform in their introductory post.

https://medium.com/the-prefect-blog/prefect-is-open-source-744e3c00cf35

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Bay Area Apache Flink Meetup (San Francisco) - Monday, March 25
https://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/258975465/

Arizona

DesertPy: Calling Native C Code and Using Kafka (Scottsdale) - Wednesday, March 27
https://www.meetup.com/Phoenix-Python-Meetup-Group/events/258449213/

New York

Two Sigma Open Source Meetup (New York) - Monday, March 25
https://www.meetup.com/Alluxio-Open-Source-New-York-Meetup/events/259474929/

An Introduction to Streaming Data and Stream Processing with Apache Kafka (Webster) - Wednesday, March 27
https://www.meetup.com/RIG-Rochester-Infrastructure-Group/events/258102479/

CANADA

Toronto Apache Spark 2.0 (Toronto) - Wednesday, March 27
https://www.meetup.com/TAS-2-0-Toronto-Apache-Spark/events/259329268/

UNITED KINGDOM

ETL in Azure Made Easy with Data Factory Data Flow (Bristol) - Tuesday, March 26
https://www.meetup.com/BigDataBristol/events/258063871/

"Everything Data" Launch: Exploring Data Engineering (Belfast) - Tuesday, March 26
https://www.meetup.com/Everything-Data/events/259119470/

SPAIN

Apache Kafka: Tips from the Trenches, or How to Fail Successfully (Madrid) - Tuesday, March 26
https://www.meetup.com/madrid-devops/events/259925537/

FRANCE

Kafka @ Accor & Gekko: Real-Time Issues and Big Data (Paris) - Tuesday, March 26
https://www.meetup.com/H-Tech-Hub/events/259594436/

Beyond Brokers: A Tour of the Kafka Environment (Villeurbanne) - Thursday, March 28
https://www.meetup.com/Lyon-Java-User-Group-LyonJUG/events/259569434/

GERMANY

Kafka Is Not Just an Author (Hamburg) - Thursday, March 28
https://www.meetup.com/jug-hamburg/events/259428837/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.