Data Eng Weekly


Data Eng Weekly Issue #256

18 March 2018

Lots of technical posts on Apache Kafka and stream processing this week. There are also great articles on Apache Airflow at Pandora, Apache Spark, Presto, and Apache Hive for data warehousing. Apache Flink had a release, and the call for proposal for Strata NY ends in a couple of days.

Sponsor

Cut HBase storage needs by 50%; elevate job throughput by 260%. ScaleFlux Computational Storage cohesively optimizes your Hadoop HBase infrastructure. Significant performance leaps on Spark, RocksDB, and Aerospike as well. Now shipping - see more:

http://bit.ly/scaleflux-computational-storage-hbase-260percent-faster

Technical

Configuring nodes in a Hadoop/Spark cluster with non-JVM dependencies has always required jumping through some hoops. If you're trying to figure out the right hoops to jump through for running PySpark with pandas and other python libraries, this post outlines a solution and provides some sample code.

http://www.florianwilhelm.info/2018/03/isolated_environments_with_pyspark/

Streaming ETL provides a number of advantages over batch—and the tools are getting better and better every month. This post describes many of the advantages, including improved latency and the ability to more easily send data to multiple downstream systems.

https://rmoff.net/2018/03/06/why-do-we-need-streaming-etl/

This article summarizes a number of design decisions that Netflix has made for their Keystone stream processing system. It's full of tips for operating Kafka at scale, including the practice of having a separate "fronting" Kafka cluster to scale producers and consumers separately, how Netflix sizes clusters and topics, and how they handle cluster upgrades.

https://quickbooks-engineering.intuit.com/lessons-learnt-from-netflix-keystone-pipeline-with-trillions-of-daily-messages-64cc91b3c8ea

GO-JEK, an Indonesian tech startup, has written about the tech stack they use to process 6 billion events/day. Built on Apache Kafka, Apache Flink, Kubernetes, Chef, and InfluxDB (to name a few), the infrastructure supports 18 product lines. The post also talks about how they've implemented monitoring and resiliency tests.

http://beckje01.com/talks/greach-2018-zipkin-light.html

An initial look at migrating from Apache Impala to Presto, this post describes some of the architectural differences (e.g. metadata caching, statistics) and Presto's support for Apache Parquet files.

https://medium.com/@adirmashiach/facebook-prestodb-full-review-4ba59720a92

The monitoring system Prometheus has a Java Agent to expose JMX metrics over HTTP. Banzai has a post that details how to configure it with Apache Kafka. They've also updated their post on integrating Spark and Prometheus—there's now a Spark package to pull in the metrics Prometheus metrics sink.

https://banzaicloud.com/blog/monitoring-kafka-prometheus/
https://banzaicloud.com/blog/spark-prometheus-sink/

A good overview of the key components of Apache Kafka security—TLS, certificates, SALS, and ACLs. There are also a number of links to dig in deeper to each of the topics.

https://medium.com/@stephane.maarek/introduction-to-apache-kafka-security-c8951d410adf

The Databricks blog has an overview of stream-stream joins, a new feature of Apache Spark 2.3. They demonstrate the various concepts using the example of joining streams of ad impressions and clicks.

https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html

NewRelic has posted the first two parts in their series on Apache Kafka. Part one covers how they use Kafka for streaming applications, to snapshot state of microservices, and for distributing work using the disrupter pattern. Part two has a good overview of Kafka partitioning and describes how New Relic orchestrates Kafka with Marathon.

https://blog.newrelic.com/2018/03/12/apache-kafka-event-processing/
https://blog.newrelic.com/2018/03/13/effective-strategies-kafka-topic-partitioning/

Pandora has written about their migration to Apache Airflow for workflow scheduling and how they operate it within the company. They provision Airflow instances on demand using Ansible, have integrated with supervisor, consul, Prometheus, and Grafana for monitoring, and make use of Airflows LDAP authentication. The post also describes some of the issues that they have run into with it, some of which lead to them maintaining their own fork.

https://engineering.pandora.com/apache-airflow-at-pandora-1d7a844d68ee

Apache Flink made a lot of progress towards their next release in February, including on changes to the deployment model (which will improve deployments on Docker/Kubernetes), a new SQL CLI, local state task recovery to speed up, and rescaling jobs.

https://data-artisans.com/blog/apache-flink-master-branch-monthly-whats-new-flink-february-2018

A good intro to Spark checkpointing, how it compares to caching, the trade-offs of enabling checkpointing, and when you may want to use.

https://medium.com/@adrianchang/apache-spark-checkpointing-ebd2ec065371

This post provides an overview of the optimizations that Apache Kafka implements to achieve high throughput.

https://medium.com/@kharekartik/what-makes-apache-kafka-so-fast-a8d4f94ab145

A thorough overview of the options (and their tradeoffs) for streaming data from an RDBMS into Apache Kafka.

https://www.confluent.io/no-more-silos-how-to-integrate-your-databases-with-apache-kafka-and-cdc

DataXu has the latest in a series on lessons learned in data warehousing. In this post, they talk about how to enable various ACID semantics through deliberate usage of metadata stored in the Hive Metastore. For example, you can end a long, multi-table transaction by repointing the tables to new datasets with Hive's alter table ... set location command. The post also talks about best practices with Spark SQL and for schema evolution.

https://medium.com/dataxutech/separation-of-metadata-and-data-building-a-cloud-native-warehouse-part-3-72680d1d97f2

The Landoop blog has a couple of guest posts from WalmartLabs on integrating Apache Kafka with Apache Cassandra. It focuses on sending data from Cassandra to Kafka using Kafka Connect. The first part describes the basic of the connector and the second part describes optimizations.

http://www.landoop.com/blog/2018/03/cassandra-to-kafka/
http://www.landoop.com/blog/2018/03/cassandra-to-kafka-part-2/

News

There's a week left to apply to the Insight Data Engineering Fellows program. It's a 7-week training fellowship in Boston, New York, or Silicon Valley.

https://www.insightdataengineering.com/

Strata goes to NYC in September, and the Call for Proposals is open until end of day Tuesday.

https://conferences.oreilly.com/strata/strata-ny/

Releases

Apache Flink 1.3.3 is out. It includes a few bug fixes (including one for checkpoints) and an improvement.

http://flink.apache.org/news/2018/03/15/release-1.3.3.html

Airflow Plugins is maintaining a list of plugins related to Apache Airflow.

https://github.com/airflow-plugins

Sponsor

Cut HBase storage needs by 50%; elevate job throughput by 260%. ScaleFlux Computational Storage cohesively optimizes your Hadoop HBase infrastructure. Significant performance leaps on Spark, RocksDB, and Aerospike as well. Now shipping - see more:

http://bit.ly/scaleflux-computational-storage-hbase-260percent-faster

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Big Data, San Francisco v 4.0 (San Francisco) - Tuesday, March 20
https://www.meetup.com/BigDataSanFran/events/247510958/

Stream Processing with Apache Kafka & Apache Samza (Sunnyvale) - Wednesday, March 21
https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/248309045/

Data Operations: A New Discipline for Data in Motion (Sunnyvale) - Wednesday, March 21
https://www.meetup.com/SF-Bay-Area-Data-Ingest-Meetup/events/248075737/

How to Use Java 8 Streams to Access Existing Data with Ultra-Low Latency (Santa Clara) - Wednesday, March 21
https://www.meetup.com/datariders/events/244709565/

Ohio

Cleveland Big Data Meetup (Mayfield Village) - Monday, March 19
https://www.meetup.com/Cleveland-Hadoop/events/246719721/

Georgia

How to Share State across Multiple Spark Jobs Using Apache Ignite (Roswell) - Tuesday, March 20
https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/247807609/

Virginia

High-Speed Data Visualization: Kafka Meets Elasticsearch (Ashburn) - Wednesday, March 21
https://www.meetup.com/Apache-Kafka-DC/events/247769441/

Kafka: Streaming Data Query Using KSQL (Richmond) - Thursday, March 22
https://www.meetup.com/rva-new-tech/events/247727677/

COLOMBIA

Microservices and Kafka with Node.js (Medellin) - Thursday, March 22
https://www.meetup.com/node_co/events/248803942/

UNITED KINGDOM

Node.js with Kafka and GraphQL (London) - Tuesday, March 20
https://www.meetup.com/LNM-London-Node-JS-Meetup/events/247995267/

SPAIN

Schibsted Data Journey (Barcelona) - Thursday, March 22
https://www.meetup.com/Spark-Barcelona/events/248665472/

FRANCE

Building Event-Driven Microservices with Apache Kafka and Kafka Streams (Paris) - Wednesday, March 21
https://www.meetup.com/Microservices-Paris/events/248054405/

GERMANY

Apache Kafka by Alex Piermatteo & KSQL by Kai Waehner (Munich) - Wednesday, March 21
https://www.meetup.com/Apache-Kafka-Germany-Munich/events/248452086/

CZECH REPUBLIC

"Classic" ETL Approach in Hadoop Ecosystem (Prague) - Thursday, March 22
https://www.meetup.com/CS-HUG/events/248656015/

NORWAY

Elasticsearch Best Practises for Performance and Scale (Oslo) - Thursday, March 22
https://www.meetup.com/Data-Engineering-Oslo/events/248537590/

ISRAEL

The Move from Micro-Service to Event-Sourcing Architecture (Ramat Gan) - Wednesday, March 21
https://www.meetup.com/ApacheKafkaTLV/events/247948810/