Data Eng Weekly


Data Eng Weekly Issue #292

09 December 2018

Lots of great content this covering a variety of topics, like Apache Pulsar, Amazon Redshift, Apache Spark, TimescaleDB, and distributed consensus in FaunaDB. There are several posts about Apache Kafka—covering its architecture, Kafka Streams, and Kafka at Paypal. In releases, there are new versions of several projects, and there's a new open-source project from LinkedIn that makes it easy to author User Defined Functions that execute efficiently in multiple execution engines.

Sponsor

Introducing the first AI / Machine Learning course with a job guarantee. Springboard's new AI / Machine Learning Career Track is an intensive program that will equip you to transition into a role as a machine learning engineer.

With support from their career services team, they confident that you will find a job within six months of completing the course. But if you don’t, they'll refund your tuition.

Learn more today: http://bit.ly/springboard-mle

Technical

This post provides an introduction to horizontal scaling, and when it's worth doing (given the complexity it adds).

https://blog.wallaroolabs.com/2018/11/horizontal-scaling-reasons/

The Alibaba Cloud blog has a thorough overview of Apache Kafka, covering how it relates to traditional message queues, its scaling characteristics, and its core components like brokers, topics, and partitions. The post has a number of good diagrams and pictures to illustrate these concepts.

https://www.alibabacloud.com/blog/an-overview-of-kafka-distributed-message-system_594218

This post is a good introduction to the new security features of Apache Kafka 2.0. There are examples, with a description of tradeoffs, of a couple of ways to configure topic-level permissions. It also discusses how to do so with KSQL and Kafka Streams.

https://www.confluent.io/blog/kafka-streams-ksql-minimum-privileges

Netflix has a post describing their approach to replicating data between cache servers in multiple availability zones. Interestingly, their architecture started with Apache Kafka for replication but moved to a S3+SQS architecture for the reasons they enumerate.

https://medium.com/netflix-techblog/cache-warming-agility-for-a-stateful-service-2d3b1da82642

Udemy has a good post on their experience with scaling Amazon Redshift and improving its performance. They cover some of the key Redshift concepts, like key distribution (and some tips for it), how to monitor tables for best practices, and how they analyzed and resolved issues with query queuing.

https://medium.com/udemy-engineering/improving-amazon-redshift-performance-our-data-warehouse-story-5ec1282c13d8

TimescaleDB is an interesting extension to Postgres that adds optimized support for time series data. In this post, ShiftLeft writes why they chose TimescaleDB, how they downsample and query downsampled data, how they monitor with Prometheus, and more.

https://blog.shiftleft.io/time-series-at-shiftleft-e1f98196909b

A good introduction to the Calvin protocol, which is used by FaunaDB to provide distributed consistency with high availability. The post describes the CAP protocol, the evolution of distributed databases, Google Spanner, and a bit more.

https://www.infoq.com/articles/relational-nosql-fauna

If you're in an organization using Windows, this post covers centralizing Exchange Server logs using the Amazon Kinesis Agent for Windows.

https://aws.amazon.com/blogs/big-data/manage-centralized-microsoft-exchange-server-logs-using-amazon-kinesis-agent-for-windows/

Zhaopin, which operates a large job board in China, writes about how rolled out Apache Pulsar to replace RabbitMQ and Apache Kafka. The describe the features of Pulsar and the reasoning that lead to their choice.

https://medium.com/@codelipenghui/simplifying-zhaopins-event-center-with-apache-pulsar-9784b73bead1

Hortonworks recently added support for Kafka Streams to their platform. They've published a comparison between Kafka Streams and the other streaming platforms they support: Apache Storm and Apache Spark Structured Streaming.

https://hortonworks.com/blog/kafka-streams-is-it-the-right-stream-processing-engine-for-you/

Quickbooks has two posts about their data platform, in which they describe their tool for automating data discovery/detecting schema changes, and track data lineage. The implementation details are sparse, but these tools might act as good inspiration (there are some very compelling screenshots!) if you're building something similar.

https://quickbooks-engineering.intuit.com/automating-data-sources-discovery-governance-with-virtual-steward-91bbbea25bed
https://quickbooks-engineering.intuit.com/demystifying-complex-data-pipeline-lineage-with-superglue-d5b4014b1482

A good collection of hard-earned tips and tricks for Apache Spark. These include suggestions for monitoring with Graphite, analyzing GC logs with GCeasy, using JDBC sources and sinks, and some app-level optimizations.

https://medium.com/teads-engineering/spark-from-the-trenches-part-2-f2ff9ab67ea1

PayPal has over 7PB of data in Apache Kafka, so they've got some good experience with building fast data products. This presentation covers their architecture, like how they use change data capture and why you should use Apache Avro for your data in Kafka.

https://www.slideshare.net/r39132/big-data-fast-data-paypal

A collection of exercises to learn Apache Kafka Streams. There's a test suite to verify that your solution is correct.

https://github.com/ardlema/kafka-streams-workshop

The Hortonworks blog has a post describing how to build a Kafka Streams-based application to capture and analyze sensor data from a trucking fleet. The post describes the several Kafka Streams microservices (and provides working Java code) to analyze, join, and window the various data streams.

https://hortonworks.com/blog/building-secure-and-governed-microservices-with-kafka-streams/

News

The Call for Papers for Kafka Summits New York and London in 2019 are now open. The deadline for both is December 20th.

https://www.confluent.io/blog/kafka-summit-2019-cfps-tracks-office-hours

This article has a pretty good comparison of the responsibility breakdown between data engineering and data scientist.

https://www.digitalsource.io/data-engineer-vs-data-scientist

The Call for Proposals for DataEngConf SF, which takes place in April, is also open through January 11th.

https://www.datacouncil.ai/data-science-engineering-call-for-proposals-decsf19

Jobs

Post a job to the Data Eng Weekly job board for $99. https://jobs.dataengweekly.com/

Releases

Transport is a recently open-sourced (from LinkedIn) API and framework for writing User Defined Functions that work across a number of projects (e.g. Hive, Presto, and Spark). The introductory post from the authors describes how they've architected the project for high performance by automatically generating platform-specific wrappers.

https://engineering.linkedin.com/blog/2018/11/using-translatable-portable-UDFs
https://github.com/linkedin/transport

Wallaroo 0.6.0 is out with a new Wallaroo API. The release notes have a fantastic overview of what's changed in the new release of their streaming platform and how to upgrade.

https://github.com/WallarooLabs/wallaroo/releases/tag/0.6.0

DataStax Enterprise 6.7 was released. The release includes a new Kafka Connector, backups to blob storage, an enhanced Spark Connector for analytics, and more.

https://www.datastax.com/2018/12/announcing-datastax-enterprise-6-7-and-more

Apache HBase 2.0.3 is out with stability and bug fixes.

https://lists.apache.org/thread.html/8cee92d870d733d2b02ac29843f345ac9cbde5f4507253993a0d8208@%3Cannounce.apache.org%3E

Version 2.5.2 of Apache Kylin, the OLAP interface for the Hadoop ecosystem is out.

https://lists.apache.org/thread.html/fa73e8a9e24ba47813bba7a76c08ab1bfabb3f672413c756cbafb030@%3Cannounce.apache.org%3E

Apache CouchDB released version 2.3. The new version has improved security, better performance, bug fixes, and several operational improvements.

http://docs.couchdb.org/en/stable/whatsnew/2.3.html

Apache Impala 3.1 is out. It includes some major new features, like support for ORC files.

https://lists.apache.org/thread.html/6006ea5ad583e95b7a9c7500690a0451b4085b744df1ffdec29ded1b@%3Cannounce.apache.org%3E

A calculator, in OpenDocument spreadsheet format, for modeling Kafka clusters (including instance types/costs for various cloud providers).

https://github.com/jkorab/kafka-cloud-calculator

Events

Curated by Datadog ( http://www.datadog.com )

California

Bay Area Flink Meetup @ Mix (Mountain View) - Tuesday, December 11
https://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/256576273/

Georgia

The Present State of Operational Analytics and Apache Druid (Atlanta) - Thursday, December 13
https://www.meetup.com/PyData-Atlanta/events/255458627/

UNITED KINGDOM

Scalability with Apache Kafka & Overcome the Challenges of Being a Woman in Tech (London) - Tuesday, December 11
https://www.meetup.com/betfair-women-in-tech/events/256648404/

SPAIN

Using Apache Kafka from Go by Javier Sanz (Madrid) - Tuesday, December 11
https://www.meetup.com/go-mad/events/256806284/

FRANCE

Apache Kafka (Brest) - Monday, December 10
https://www.meetup.com/FinistDevs/events/256912656/

First Apache Kafka Lyon Meetup! (Lyon) - Monday, December 10
https://www.meetup.com/Lyon-Kafka-meetup/events/256621331/

BELGIUM

Kafka at Klarrio (Antwerpen) - Tuesday, December 11
https://www.meetup.com/Belgium-Kafka/events/255741326/

GERMANY

ATM Fraud Detection with Apache Kafka and KSQL (Frankfurt) - Monday, December 10
https://www.meetup.com/Frankfurt-Apache-Kafka-Meetup-by-Confluent/events/256599175/

SINGAPORE

Event Driven Microservices with Kafka & Evolution of the Data Pipeline in Agoda (Singapore) - Tuesday, December 11
https://www.meetup.com/Singapore-Kafka-Meetup/events/256649129/

Good Company Series ft. Shopee: Data Engineering & Analytics Sharing (Singapore) - Thursday, December 13
https://www.meetup.com/meetup-group-iGvDQjKN/events/256935189/