10 February 2019
This week's issue has the regular amount of content on Kafka and streaming data, and it also has several articles on some less frequent topics. These include Redshift, some debugging stories (such as with YARN+cgroups), Kubernetes, and loading data into Big Query using Google Cloud Functions. In releases, there are some interesting new projects from LinkedIn (Kafka Cruise Control Frontend) and FANDOM (Athena Alerter).
This tutorial shows how to build a Redshift query that efficiently joins data from the MaxMind GeoIP database to analyze the location of IPs. The solution includes a neat trick to optimize the join cost by computing a lookup table that enables filtering by IP prefix.
https://towardsdatascience.com/the-easy-way-to-use-maxmind-geoip-with-redshift-65cf979e63b1
G-Research shares a good debugging story about how YARN's cgroup usage led to Linux Kernel issues (and how to fix it!).
https://www.gresearch.co.uk/2019/01/28/hadoop-yarn-cgroup-stability-issues/
The team at Disney Streaming writes about how they've built a solution to auto scale Amazon Kinesis Streams. The tool is built on AWS Lambda and works to predictively scale up/down based on log data.
This post describes the components of Apache Kafka that depend on Apache ZooKeeper. It then describes how to replace those pieces with an implementation of the Raft protocol built using the Atomix framework. Code for the implementation is on GitHub.
https://medium.com/@lukasz.antoniak/apache-kafka-leaves-the-zoo-bef529ba82b7
Event sourcing can enable lots of compelling use cases, but like many architectural designs there are trade-offs. In this post, the author shares his opinions/experience on a number of those—like upstart cost and complexity challenges when consuming an audit log.
https://chriskiehl.com/article/event-sourcing-is-hard
Zenreach writes about how they implemented Kafka with Kafka Streams to process events pertaining to customer data, offloading from Mongo. They share details about their implementation, including advice for testing and a gotcha with co-partitioning data.
This post describes some tricks for loading data into Redshift with Apache Airflow as well as efficiently querying. Examples include running schema migrations as the first step in a workflow and writing a separate workflow to VACUUM the database.
A look at configuring Hive-on-Spark on CDH, which involves some special tuning and workarounds (as only Spark 1.x is supported with this setup).
This post captures the similarities between Kafka topic and ACL management with that of RDBMS schema management, and it proposes a migration-like solution to managing those configurations.
https://riccomini.name/managing-kafka-topic-configuration
Kubernetes provides some great primitives for working with distributed systems. One of them is a PodDisruptionBudget, which can help avoid the type of issue described in this post by ensuring that pods aren't deleted under certain constraints.
Scylla writes about the data migration tool that they've built for migrating data from Apache Cassandra to their DB. The tool has some interesting properties, such as the ability to resume from a checkpoint, preservation of TTL and modification time, and support for simple transformations.
A tutorial for using a Google Cloud Function to load data via Google Cloud Storage into BigQuery. The post covers configuring permissions, managing API key secrets (for capturing data from a 3rd party system), and deployment.
http://tamaszilagyi.com/blog/2019/2019-02-10-serverless/
A good overview of the big data landscape on AWS, from storage to data processing to orchestration to reading data. This is a good map of how the tools fit together with a brief introduction to each.
https://www.waitingforcode.com/data-aws/doing-data-aws-overview/read
This tutorial shows how to configure filebeat (a tool from Elastic) to send log data to Kafka. It includes a docker compose demo.
https://medium.com/@itseranga/publish-logs-to-kafka-with-filebeat-74497ef7dafe
Data Engineer - Python, Wooga, Berlin https://jobs.dataengweekly.com/jobs/63fbb5ea-1c49-463f-bda7-598a56a13831
Software Engineer, Value Platform, Nuna, Inc., San Francisco https://jobs.dataengweekly.com/jobs/4a88b4e0-3457-48da-977e-afa368cdd4f1
Data Engineer, Starship Technologies, Tallinn, Estonia https://jobs.dataengweekly.com/jobs/7c9152cb-d0b5-496d-b304-3252e0c01c3f
dataArtisans, who was recently purchased by Alibaba, have renamed to Ververica. The blog has more about how their new name relates to plans for the future.
https://www.ververica.com/blog/introducing-our-new-name
Debezium 0.9.0 Final has been announced. The release of the change data capture tool adds a new connector for SQL Server, supports the latest versions of other supported databases and Apache Kafka, and has several other improvements more. More about the release on the Debezium blog.
https://debezium.io/blog/2019/02/05/debezium-0-9-0-final-released/
Version 5.2 of Databricks Runtime has been released. The new features are a new experimental time travel feature (more details in the second post), a fast Apache Parquet importer, and a new notebook feature that presents tips and advice inline as part of query execution.
https://databricks.com/blog/2019/02/05/announcing-databricks-runtime-5-2.html
https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html
FANDOM has open sourced their Athena Alerter, which is a tool for analyzing Amazon Athena queries via a Lambda function and alerting when costs are high. They have a cloudformation template for deploying the required components.
https://medium.com/fandom-engineering/aws-athena-alerter-6508f882a216
LinkedIn has open sourced the Cruise Control Frontend, a UI for managing and applying changes to Kafka clusters that are executed by the Cruise Control project. It includes a number of features that are highlighted in the introductory blog post.
https://engineering.linkedin.com/blog/2019/02/introducing-kafka-cruise-control-frontend
Apache Hadoop 3.1.2 was released. It includes over 300 JIRAs, with improvements to Docker and GPU support on YARN, lots of improvements and bug fixes to YARN, and AliyunOSS improvements.
Curated by Datadog ( http://www.datadog.com )
Leveraging Microservices & Kafka to Scale Developer Productivity (Sunnyvale) - Tuesday, February 12
https://www.meetup.com/KafkaBayArea/events/258103477/
Maintaining Full Data Lineage + Migration & Change Data Capture with CDAP (Palo Alto) - Wednesday, February 13
https://www.meetup.com/BigDataApps/events/257120249/
Dissolving the Problem: Kafka is more ACID Than Your Database (Denver) - Monday, February 11
https://www.meetup.com/ddd-denver/events/256878448/
Kafka Is More ACID Than Your Database (Denver) - Tuesday, February 12
https://www.meetup.com/Front-Range-Apache-Kafka/events/258147911/
What Is Apache Kafka + Wayfair's Journey with Apache Kafka (Boston) - Tuesday, February 12
https://www.meetup.com/Boston-Apache-kafka-Meetup/events/257342143/
Microservices & KSQL (Hamburg) - Tuesday, February 12
https://www.meetup.com/Hamburg-Kafka/events/258573754/
How to Successfully Fail with Apache Kafka + Fraud Detection with KSQL (Berlin) - Wednesday, February 13
https://www.meetup.com/Berlin-Apache-Kafka-Meetup-by-Confluent/events/258688406/
Apache Flink @ Teralytics (Zurich) - Wednesday, February 13
https://www.meetup.com/Apache-Flink-Meetup-Zurich/events/258131778/
Kafka and Stream Processing Meetup at LinkedIn (Bangalore) - Saturday, February 16
https://www.meetup.com/Bangalore-Apache-Kafka-Group/events/258256392/
Developing Contextual, Event-Driven Applications with KSQL and Kafka (Sydney) - Tuesday, February 12
https://www.meetup.com/apache-kafka-sydney/events/258759506/
Using Apache Cassandra and Apache Kafka to Scale Next-Gen Applications (Melbourne) - Wednesday, February 13
https://www.meetup.com/Big-Data-Analytics-Meetup-Group/events/257299627/
Bridging from Middleware to Event-Streaming (Perth) - Friday, February 15
https://www.meetup.com/Perth-Kafka/events/258701586/