18 March 2018
Lots of technical posts on Apache Kafka and stream processing this week. There are also great articles on Apache Airflow at Pandora, Apache Spark, Presto, and Apache Hive for data warehousing. Apache Flink had a release, and the call for proposal for Strata NY ends in a couple of days.
Cut HBase storage needs by 50%; elevate job throughput by 260%. ScaleFlux Computational Storage cohesively optimizes your Hadoop HBase infrastructure. Significant performance leaps on Spark, RocksDB, and Aerospike as well. Now shipping - see more:
http://bit.ly/scaleflux-computational-storage-hbase-260percent-faster
Configuring nodes in a Hadoop/Spark cluster with non-JVM dependencies has always required jumping through some hoops. If you're trying to figure out the right hoops to jump through for running PySpark with pandas and other python libraries, this post outlines a solution and provides some sample code.
http://www.florianwilhelm.info/2018/03/isolated_environments_with_pyspark/
Streaming ETL provides a number of advantages over batch—and the tools are getting better and better every month. This post describes many of the advantages, including improved latency and the ability to more easily send data to multiple downstream systems.
https://rmoff.net/2018/03/06/why-do-we-need-streaming-etl/
This article summarizes a number of design decisions that Netflix has made for their Keystone stream processing system. It's full of tips for operating Kafka at scale, including the practice of having a separate "fronting" Kafka cluster to scale producers and consumers separately, how Netflix sizes clusters and topics, and how they handle cluster upgrades.
GO-JEK, an Indonesian tech startup, has written about the tech stack they use to process 6 billion events/day. Built on Apache Kafka, Apache Flink, Kubernetes, Chef, and InfluxDB (to name a few), the infrastructure supports 18 product lines. The post also talks about how they've implemented monitoring and resiliency tests.
http://beckje01.com/talks/greach-2018-zipkin-light.html
An initial look at migrating from Apache Impala to Presto, this post describes some of the architectural differences (e.g. metadata caching, statistics) and Presto's support for Apache Parquet files.
https://medium.com/@adirmashiach/facebook-prestodb-full-review-4ba59720a92
The monitoring system Prometheus has a Java Agent to expose JMX metrics over HTTP. Banzai has a post that details how to configure it with Apache Kafka. They've also updated their post on integrating Spark and Prometheus—there's now a Spark package to pull in the metrics Prometheus metrics sink.
https://banzaicloud.com/blog/monitoring-kafka-prometheus/
https://banzaicloud.com/blog/spark-prometheus-sink/
A good overview of the key components of Apache Kafka security—TLS, certificates, SALS, and ACLs. There are also a number of links to dig in deeper to each of the topics.
https://medium.com/@stephane.maarek/introduction-to-apache-kafka-security-c8951d410adf
The Databricks blog has an overview of stream-stream joins, a new feature of Apache Spark 2.3. They demonstrate the various concepts using the example of joining streams of ad impressions and clicks.
https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html
NewRelic has posted the first two parts in their series on Apache Kafka. Part one covers how they use Kafka for streaming applications, to snapshot state of microservices, and for distributing work using the disrupter pattern. Part two has a good overview of Kafka partitioning and describes how New Relic orchestrates Kafka with Marathon.
https://blog.newrelic.com/2018/03/12/apache-kafka-event-processing/
https://blog.newrelic.com/2018/03/13/effective-strategies-kafka-topic-partitioning/
Pandora has written about their migration to Apache Airflow for workflow scheduling and how they operate it within the company. They provision Airflow instances on demand using Ansible, have integrated with supervisor, consul, Prometheus, and Grafana for monitoring, and make use of Airflows LDAP authentication. The post also describes some of the issues that they have run into with it, some of which lead to them maintaining their own fork.
https://engineering.pandora.com/apache-airflow-at-pandora-1d7a844d68ee
Apache Flink made a lot of progress towards their next release in February, including on changes to the deployment model (which will improve deployments on Docker/Kubernetes), a new SQL CLI, local state task recovery to speed up, and rescaling jobs.
https://data-artisans.com/blog/apache-flink-master-branch-monthly-whats-new-flink-february-2018
A good intro to Spark checkpointing, how it compares to caching, the trade-offs of enabling checkpointing, and when you may want to use.
https://medium.com/@adrianchang/apache-spark-checkpointing-ebd2ec065371
This post provides an overview of the optimizations that Apache Kafka implements to achieve high throughput.
https://medium.com/@kharekartik/what-makes-apache-kafka-so-fast-a8d4f94ab145
A thorough overview of the options (and their tradeoffs) for streaming data from an RDBMS into Apache Kafka.
https://www.confluent.io/no-more-silos-how-to-integrate-your-databases-with-apache-kafka-and-cdc
DataXu has the latest in a series on lessons learned in data warehousing. In this post, they talk about how to enable various ACID semantics through deliberate usage of metadata stored in the Hive Metastore. For example, you can end a long, multi-table transaction by repointing the tables to new datasets with Hive's alter table ... set location
command. The post also talks about best practices with Spark SQL and for schema evolution.
The Landoop blog has a couple of guest posts from WalmartLabs on integrating Apache Kafka with Apache Cassandra. It focuses on sending data from Cassandra to Kafka using Kafka Connect. The first part describes the basic of the connector and the second part describes optimizations.
http://www.landoop.com/blog/2018/03/cassandra-to-kafka/
http://www.landoop.com/blog/2018/03/cassandra-to-kafka-part-2/
There's a week left to apply to the Insight Data Engineering Fellows program. It's a 7-week training fellowship in Boston, New York, or Silicon Valley.
https://www.insightdataengineering.com/
Strata goes to NYC in September, and the Call for Proposals is open until end of day Tuesday.
https://conferences.oreilly.com/strata/strata-ny/
Apache Flink 1.3.3 is out. It includes a few bug fixes (including one for checkpoints) and an improvement.
http://flink.apache.org/news/2018/03/15/release-1.3.3.html
Airflow Plugins is maintaining a list of plugins related to Apache Airflow.
https://github.com/airflow-plugins
Cut HBase storage needs by 50%; elevate job throughput by 260%. ScaleFlux Computational Storage cohesively optimizes your Hadoop HBase infrastructure. Significant performance leaps on Spark, RocksDB, and Aerospike as well. Now shipping - see more:
http://bit.ly/scaleflux-computational-storage-hbase-260percent-faster
Curated by Datadog ( http://www.datadog.com )
Big Data, San Francisco v 4.0 (San Francisco) - Tuesday, March 20
https://www.meetup.com/BigDataSanFran/events/247510958/
Stream Processing with Apache Kafka & Apache Samza (Sunnyvale) - Wednesday, March 21
https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/248309045/
Data Operations: A New Discipline for Data in Motion (Sunnyvale) - Wednesday, March 21
https://www.meetup.com/SF-Bay-Area-Data-Ingest-Meetup/events/248075737/
How to Use Java 8 Streams to Access Existing Data with Ultra-Low Latency (Santa Clara) - Wednesday, March 21
https://www.meetup.com/datariders/events/244709565/
Cleveland Big Data Meetup (Mayfield Village) - Monday, March 19
https://www.meetup.com/Cleveland-Hadoop/events/246719721/
How to Share State across Multiple Spark Jobs Using Apache Ignite (Roswell) - Tuesday, March 20
https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/247807609/
High-Speed Data Visualization: Kafka Meets Elasticsearch (Ashburn) - Wednesday, March 21
https://www.meetup.com/Apache-Kafka-DC/events/247769441/
Kafka: Streaming Data Query Using KSQL (Richmond) - Thursday, March 22
https://www.meetup.com/rva-new-tech/events/247727677/
Microservices and Kafka with Node.js (Medellin) - Thursday, March 22
https://www.meetup.com/node_co/events/248803942/
Node.js with Kafka and GraphQL (London) - Tuesday, March 20
https://www.meetup.com/LNM-London-Node-JS-Meetup/events/247995267/
Schibsted Data Journey (Barcelona) - Thursday, March 22
https://www.meetup.com/Spark-Barcelona/events/248665472/
Building Event-Driven Microservices with Apache Kafka and Kafka Streams (Paris) - Wednesday, March 21
https://www.meetup.com/Microservices-Paris/events/248054405/
Apache Kafka by Alex Piermatteo & KSQL by Kai Waehner (Munich) - Wednesday, March 21
https://www.meetup.com/Apache-Kafka-Germany-Munich/events/248452086/
"Classic" ETL Approach in Hadoop Ecosystem (Prague) - Thursday, March 22
https://www.meetup.com/CS-HUG/events/248656015/
Elasticsearch Best Practises for Performance and Scale (Oslo) - Thursday, March 22
https://www.meetup.com/Data-Engineering-Oslo/events/248537590/
The Move from Micro-Service to Event-Sourcing Architecture (Ramat Gan) - Wednesday, March 21
https://www.meetup.com/ApacheKafkaTLV/events/247948810/