Data Eng Weekly Issue #301

10 February 2019

This week's issue has the regular amount of content on Kafka and streaming data, and it also has several articles on some less frequent topics. These include Redshift, some debugging stories (such as with YARN+cgroups), Kubernetes, and loading data into Big Query using Google Cloud Functions. In releases, there are some interesting new projects from LinkedIn (Kafka Cruise Control Frontend) and FANDOM (Athena Alerter).

Technical

This tutorial shows how to build a Redshift query that efficiently joins data from the MaxMind GeoIP database to analyze the location of IPs. The solution includes a neat trick to optimize the join cost by computing a lookup table that enables filtering by IP prefix.

https://towardsdatascience.com/the-easy-way-to-use-maxmind-geoip-with-redshift-65cf979e63b1

G-Research shares a good debugging story about how YARN's cgroup usage led to Linux Kernel issues (and how to fix it!).

https://www.gresearch.co.uk/2019/01/28/hadoop-yarn-cgroup-stability-issues/

The team at Disney Streaming writes about how they've built a solution to auto scale Amazon Kinesis Streams. The tool is built on AWS Lambda and works to predictively scale up/down based on log data.

https://medium.com/disney-streaming/delivering-data-in-real-time-via-auto-scaling-kinesis-streams-72a0236b2cd9

This post describes the components of Apache Kafka that depend on Apache ZooKeeper. It then describes how to replace those pieces with an implementation of the Raft protocol built using the Atomix framework. Code for the implementation is on GitHub.

https://medium.com/@lukasz.antoniak/apache-kafka-leaves-the-zoo-bef529ba82b7

Event sourcing can enable lots of compelling use cases, but like many architectural designs there are trade-offs. In this post, the author shares his opinions/experience on a number of those—like upstart cost and complexity challenges when consuming an audit log.

https://chriskiehl.com/article/event-sourcing-is-hard

Zenreach writes about how they implemented Kafka with Kafka Streams to process events pertaining to customer data, offloading from Mongo. They share details about their implementation, including advice for testing and a gotcha with co-partitioning data.

https://www.confluent.io/blog/beginners-perspective-kafka-streams-building-real-time-walkthrough-detection

This post describes some tricks for loading data into Redshift with Apache Airflow as well as efficiently querying. Examples include running schema migrations as the first step in a workflow and writing a separate workflow to VACUUM the database.

https://medium.com/velotio-perspectives/lessons-learnt-while-building-an-etl-pipeline-for-mongodb-amazon-redshift-using-apache-airflow-543bb0b75017

A look at configuring Hive-on-Spark on CDH, which involves some special tuning and workarounds (as only Spark 1.x is supported with this setup).

https://medium.com/getindata-blog/enabling-hive-on-spark-on-cdh-5-14-a-few-problems-and-solutions-e056479aed7f

This post captures the similarities between Kafka topic and ACL management with that of RDBMS schema management, and it proposes a migration-like solution to managing those configurations.

https://riccomini.name/managing-kafka-topic-configuration

Kubernetes provides some great primitives for working with distributed systems. One of them is a PodDisruptionBudget, which can help avoid the type of issue described in this post by ensuring that pods aren't deleted under certain constraints.

https://content.pivotal.io/blog/root-cause-of-an-application-outage-on-kubernetes-and-how-we-fixed-it

Scylla writes about the data migration tool that they've built for migrating data from Apache Cassandra to their DB. The tool has some interesting properties, such as the ability to resume from a checkpoint, preservation of TTL and modification time, and support for simple transformations.

https://www.scylladb.com/2019/02/07/moving-from-cassandra-to-scylla-via-apache-spark-scylla-migrator/

A tutorial for using a Google Cloud Function to load data via Google Cloud Storage into BigQuery. The post covers configuring permissions, managing API key secrets (for capturing data from a 3rd party system), and deployment.

http://tamaszilagyi.com/blog/2019/2019-02-10-serverless/

A good overview of the big data landscape on AWS, from storage to data processing to orchestration to reading data. This is a good map of how the tools fit together with a brief introduction to each.

https://www.waitingforcode.com/data-aws/doing-data-aws-overview/read

This tutorial shows how to configure filebeat (a tool from Elastic) to send log data to Kafka. It includes a docker compose demo.

https://medium.com/@itseranga/publish-logs-to-kafka-with-filebeat-74497ef7dafe

Jobs

Data Engineer - Python, Wooga, Berlin https://jobs.dataengweekly.com/jobs/63fbb5ea-1c49-463f-bda7-598a56a13831

Software Engineer, Value Platform, Nuna, Inc., San Francisco https://jobs.dataengweekly.com/jobs/4a88b4e0-3457-48da-977e-afa368cdd4f1

Data Engineer, Starship Technologies, Tallinn, Estonia https://jobs.dataengweekly.com/jobs/7c9152cb-d0b5-496d-b304-3252e0c01c3f

News

dataArtisans, who was recently purchased by Alibaba, have renamed to Ververica. The blog has more about how their new name relates to plans for the future.

https://www.ververica.com/blog/introducing-our-new-name

Releases

Debezium 0.9.0 Final has been announced. The release of the change data capture tool adds a new connector for SQL Server, supports the latest versions of other supported databases and Apache Kafka, and has several other improvements more. More about the release on the Debezium blog.

https://debezium.io/blog/2019/02/05/debezium-0-9-0-final-released/

Version 5.2 of Databricks Runtime has been released. The new features are a new experimental time travel feature (more details in the second post), a fast Apache Parquet importer, and a new notebook feature that presents tips and advice inline as part of query execution.

https://databricks.com/blog/2019/02/05/announcing-databricks-runtime-5-2.html
https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html

FANDOM has open sourced their Athena Alerter, which is a tool for analyzing Amazon Athena queries via a Lambda function and alerting when costs are high. They have a cloudformation template for deploying the required components.

https://medium.com/fandom-engineering/aws-athena-alerter-6508f882a216

LinkedIn has open sourced the Cruise Control Frontend, a UI for managing and applying changes to Kafka clusters that are executed by the Cruise Control project. It includes a number of features that are highlighted in the introductory blog post.

https://engineering.linkedin.com/blog/2019/02/introducing-kafka-cruise-control-frontend

Apache Hadoop 3.1.2 was released. It includes over 300 JIRAs, with improvements to Docker and GPU support on YARN, lots of improvements and bug fixes to YARN, and AliyunOSS improvements.

https://lists.apache.org/thread.html/a6da1d67d42b36b1c20a2ba3dbd6386be3bd991e87e0370b9d90e53e@%3Cgeneral.hadoop.apache.org%3E

Events

Curated by Datadog ( http://www.datadog.com )

California

Leveraging Microservices & Kafka to Scale Developer Productivity (Sunnyvale) - Tuesday, February 12
https://www.meetup.com/KafkaBayArea/events/258103477/

Maintaining Full Data Lineage + Migration & Change Data Capture with CDAP (Palo Alto) - Wednesday, February 13
https://www.meetup.com/BigDataApps/events/257120249/

Colorado

Dissolving the Problem: Kafka is more ACID Than Your Database (Denver) - Monday, February 11
https://www.meetup.com/ddd-denver/events/256878448/

Kafka Is More ACID Than Your Database (Denver) - Tuesday, February 12
https://www.meetup.com/Front-Range-Apache-Kafka/events/258147911/

Massachusetts

What Is Apache Kafka + Wayfair's Journey with Apache Kafka (Boston) - Tuesday, February 12
https://www.meetup.com/Boston-Apache-kafka-Meetup/events/257342143/

GERMANY

Microservices & KSQL (Hamburg) - Tuesday, February 12
https://www.meetup.com/Hamburg-Kafka/events/258573754/

How to Successfully Fail with Apache Kafka + Fraud Detection with KSQL (Berlin) - Wednesday, February 13
https://www.meetup.com/Berlin-Apache-Kafka-Meetup-by-Confluent/events/258688406/

SWITZERLAND

Apache Flink @ Teralytics (Zurich) - Wednesday, February 13
https://www.meetup.com/Apache-Flink-Meetup-Zurich/events/258131778/

INDIA

Kafka and Stream Processing Meetup at LinkedIn (Bangalore) - Saturday, February 16
https://www.meetup.com/Bangalore-Apache-Kafka-Group/events/258256392/

AUSTRALIA

Developing Contextual, Event-Driven Applications with KSQL and Kafka (Sydney) - Tuesday, February 12
https://www.meetup.com/apache-kafka-sydney/events/258759506/

Using Apache Cassandra and Apache Kafka to Scale Next-Gen Applications (Melbourne) - Wednesday, February 13
https://www.meetup.com/Big-Data-Analytics-Meetup-Group/events/257299627/

Bridging from Middleware to Event-Streaming (Perth) - Friday, February 15
https://www.meetup.com/Perth-Kafka/events/258701586/