24 March 2019
Lots of coverage of Apache Spark this week as well as articles on Apache Kudu, Postgres, Apache Kafka Streams and more. And in releases, MR3 had a new release, kaf
is an interesting new CLI tool for Kafka, and Prefect has open sourced the core library of their workflow engine.
This post describes several aspects of running the BigQuery Kafka Connect plugin, such as rate limiting, how the connector handles deletes, and handling data deduplication (or lack thereof).
The Apache Kudu blog has an overview of the KuduTestHarness or testing JVM applications that are using Kudu.
https://kudu.apache.org/2019/03/19/testing-apache-kudu-applications-on-the-jvm.html
The pgDash blog has a post on the configurations for CPU, memory, network, and more that can be tweaked to horizontally scale your Postgres deployment on beefier hardware.
https://pgdash.io/blog/scaling-postgres.html
A look at using Mode, a hosted service for Python and R Notebooks, with data in Amazon Redshift. The post also describes the evolution of BI workflows at most companies and how to use Fivetran to ETL data into Redshift.
Databricks Delta supports the SQL MERGE command, which can be used to update records in a Databricks Data Lake. Their post covers items like deleting records for GDPR compliance and applying updates based on change data capture.
https://databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html
The Qubole blog has a post on the Kinesis Connector for Spark Structured Streaming, which covers the connector's architecture and has an example of a streaming job.
https://www.qubole.com/blog/kinesis-connector-for-structured-streaming/
This post provides an extensive overview of Apache Spark's windowing functions for things like ranks within an ordered window, lag & lead to compare to previous/next/other values in the window, and more. The post provides code to illustrate each of these by executing against a sample data set.
https://knockdata.github.io/spark-window-function/
The Confluent blog looks at the new Suppress operation in Apache Kafka Streams, which can be used to simplify use cases like alerting. Suppress supports both a time delay and waiting until a particular window has closed to trigger an alert. The post also covers some considerations related to in memory buffering.
https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers
When running an EMR cluster, you have the option of storing data in S3 or in HDFS on the cluster. This article describes a number of options for copying data between S3 and HDFS, and it shows how (based on a couple of simple Presto queries) querying data in HDFS can be much faster.
https://tech.marksblogg.com/faster-file-distribution-hadoop-hdfs-s3.html
This article covers many of the main performance tuning parameters of an Apache Spark job, such as dynamic allocation, parallelism, and speculative execution.
https://medium.com/datakaresolutions/key-factors-to-consider-when-optimizing-spark-jobs-72b1a0dc22bf
An intro to the new EXCEPT ALL
and INTERCEPT ALL
SQL operations in Apache Spark 2.4.0.
This post looks at building a MapReduce framework form scratch on Kubernetes. The system is written in Go and uses HTTP for transport. While far from a production system, it's interesting to see what building MapReduce from first principles in Kubernetes might look like.
Senior Data Engineer (Spark), N26, Berlin https://jobs.dataengweekly.com/jobs/c202d274-c3df-4274-9ffb-77a749db5c3f
Software Engineer - Data Platform, Fitbit, San Francisco, CA https://jobs.dataengweekly.com/jobs/ab98f336-41d9-455f-b0a7-f9d5975e5975
Datapractices.org is joining the LInux Foundation. The project enumerates and principles (one example is "Recognize and mitigate bias in ourselves and in the data we use.") and has a courseware on github.
The International Conference on Extending Database Technology is this week, and the papers have been posted online. Lots of interesting content including on Spark SQL and KSQL.
https://openproceedings.org/html/pages/2019_edbt.html
A new release of MR3, the execution engine for Apache Hive, has been released. The project has posted some performance comparisons, too.
https://groups.google.com/forum/#!msg/hive-mr3/EyZeAuBH_FQ/BdhbMPnGBwAJ
https://mr3.postech.ac.kr/blog/2019/03/22/performance-evaluation-0.6/
Kaf is a command line utility for Apache Kafka written in Golang. The authors credit kubectl and docker for inspiration.
https://github.com/infinimesh/kaf/releases/tag/v0.1.14
Apache NiFi 1.9.1 is out. It's a maintenance release with improvements to SFTP, JSON record readers, and more.
https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.9.1
Version 2.6.1 of Apache Kylin, the Distributed Analytics Engine, is out. There are over 25 issues included in the release.
Version 0.6.0 of Apache NiFi MiNiFi was released. Among the features are support for natively written python processors and a new structured logging library.
Prefect has open sourced their workflow engine core library. Much more about the features of the library and the platform in their introductory post.
https://medium.com/the-prefect-blog/prefect-is-open-source-744e3c00cf35
Curated by Datadog ( http://www.datadog.com )
Bay Area Apache Flink Meetup (San Francisco) - Monday, March 25
https://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/258975465/
DesertPy: Calling Native C Code and Using Kafka (Scottsdale) - Wednesday, March 27
https://www.meetup.com/Phoenix-Python-Meetup-Group/events/258449213/
Two Sigma Open Source Meetup (New York) - Monday, March 25
https://www.meetup.com/Alluxio-Open-Source-New-York-Meetup/events/259474929/
An Introduction to Streaming Data and Stream Processing with Apache Kafka (Webster) - Wednesday, March 27
https://www.meetup.com/RIG-Rochester-Infrastructure-Group/events/258102479/
Toronto Apache Spark 2.0 (Toronto) - Wednesday, March 27
https://www.meetup.com/TAS-2-0-Toronto-Apache-Spark/events/259329268/
ETL in Azure Made Easy with Data Factory Data Flow (Bristol) - Tuesday, March 26
https://www.meetup.com/BigDataBristol/events/258063871/
"Everything Data" Launch: Exploring Data Engineering (Belfast) - Tuesday, March 26
https://www.meetup.com/Everything-Data/events/259119470/
Apache Kafka: Tips from the Trenches, or How to Fail Successfully (Madrid) - Tuesday, March 26
https://www.meetup.com/madrid-devops/events/259925537/
Kafka @ Accor & Gekko: Real-Time Issues and Big Data (Paris) - Tuesday, March 26
https://www.meetup.com/H-Tech-Hub/events/259594436/
Beyond Brokers: A Tour of the Kafka Environment (Villeurbanne) - Thursday, March 28
https://www.meetup.com/Lyon-Java-User-Group-LyonJUG/events/259569434/
Kafka Is Not Just an Author (Hamburg) - Thursday, March 28
https://www.meetup.com/jug-hamburg/events/259428837/
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.