Data Eng Weekly Issue #291

02 December 2018

Lots of coverage of Apache Kafka and change data capture this week, including Kafka at Pinterest, why Twitter is moving to Kafka, and change data capture for MySQL and MongoDB. There are also posts from Criteo on their real time Apache Flink use case and AirBnB on their Spark Streaming and HBase data load system. In releases, it was a busy week with a few AWS re:Invent announcements, Dask and Apache Samza hitting version 1.0, and the first major release of FoundationDB since it was open sourced. With two weeks worth of articles to sort through, it ended up being a pretty large issue (even after I cut several good posts).

Sponsor

Introducing the first AI / Machine Learning course with a job guarantee. Springboard's new AI / Machine Learning Career Track is an intensive program that will equip you to transition into a role as a machine learning engineer.

6 months, fully online, at your own pace course
400+ hours of coursework + deeply technical projects
1:1 mentoring from AI experts at companies like Google and Facebook

With support from their career services team, they confident that you will find a job within six months of completing the course. But if you don’t, they'll refund your tuition.

Learn more today: http://bit.ly/springboard-mle

Technical

This presentation has a fantastic overview of Change Data Capture with MySQL. It motivates the use case, describes how Debezium streams changes to Kafka based on the MySQL binlog, and describes how to query the data that it creates. There's also a discussion of building the same for Apache Cassandra, which doesn't have a single binlog to plug into due to its replication model.

https://qconsf.com/system/files/presentation-slides/whys_and_hows_of_database_streaming_final.pdf

The dataArtisans blog has a post on the four key steps to making an Apache Flink deployment production ready.

https://data-artisans.com/blog/4-steps-flink-application-production-ready

Criteo writes about how they use Apache Flink for real time analytics of data in their multi-data center Kafka cluster. There are some interesting details about how they use watermarks for event time as data is replicated/re-partitioned across clusters.

https://medium.com/criteo-labs/criteo-streaming-flink-31816c08da50

For real time event data, AirBnB uses Spark Streaming to send events from Apache Kafka to Apache HBase for deduplication. Typically, Kafka consumers are limited to the parallelism of the number of topics, which doesn't scale in the way that AirBnB needs. Taking advantage of the fact that their application doesn't require processing events in order, they have built a new Spark-Kafka connector that can dynamically scale by splitting data inputs within a partition by offset ranges.

https://medium.com/airbnb-engineering/scaling-spark-streaming-for-logging-event-ingestion-4a03141d135d

This post on the Luigi workflow engine covers how the team at ZOLA Electric deploys Luigi via Docker (and configures the visualizer and task history), replicates data from RDS to Redshift, represents dependencies between SQL scripts via annotations, integrates with Slack for alerting, and more.

https://tech.offgrid-electric.com/luigi-data-pipelines-setup-7b9200727156

YugaByte DB is an interesting new database to watch—it supports the Cassandra, Redis, and Postgres (beta) APIs. In this post, they demonstrate the differences between a NoSQL and SQL database with the Cassandra and Postgres interfaces as an example. For extra comparison, they also do the same with MongoDB.

https://blog.yugabyte.com/data-modeling-basics-postgresql-vs-cassandra-vs-mongodb/

The Confluent blog describes the high level tasks needed to design, build, and test a multi-datacenter Apache Kafka deployment.

https://www.confluent.io/blog/3-ways-prepare-disaster-recovery-multi-datacenter-apache-kafka-deployments

The recent alpha release of Apache Hadoop Ozone, version 0.3.0-alpha, adds S3 API compatibility to the blob store on Hadoop system. This post shows how to get started with it.

https://hortonworks.com/blog/an-s3-gateway-to-apache-hadoop-ozone/

Pinterest has written about their Kafka deployment on AWS. They describe their architecture and configuration: replicating between clusters in separate regions, deploying brokers within a cluster across availability zones, what types of instances they use (mostly d2.2xlarge's) and the config they use for those instances, and more.

https://medium.com/pinterest-engineering/how-pinterest-runs-kafka-at-scale-ff9c6f735be

This post describes the benefits of keeping track of SQL queries in git and creating a query library to share them across the members of a team.

https://caitlinhudon.com/2018/11/28/git-sql-together/

Twitter writes about why they're moving to Apache Kafka from their home-grown (based on Apache BookKeeper) system. The main reasons include: better latency and costs savings since Kafka uses the same servers for storage and compute, the Kafka community, and improvements in hardware (specifically faster network and cheaper SSDs).

https://blog.twitter.com/engineering/en_us/topics/insights/2018/twitters-kafka-adoption-story.html

Sonra has a good overview of the differences between AWS Athena and Snowflake. It covers things like performance (concurrency, caching, query optimizer), security, data loading, file type support, and more.

https://sonra.io/2018/11/27/comparing-snowflake-cloud-data-warehouse-to-aws-athena-query-service/

A good list of PostgreSQL extensions, add-ons, tools, and more for Postgres. These include PipelineDB for continuous SQL queries, Citus for high availability, Cstore for columnar storage based on ORC, and PostgREST for putting a RESTful API directly atop of an existing database.

https://medium.com/datadriveninvestor/postgres-not-your-grandfathers-rdbms-6e3616e6c7ec

This article describes how to configure Debezium to build a change data capture pipeline atop of MongoDB.

https://medium.com/@hpgrahsl/optimizing-read-access-to-sharded-mongodb-collections-utilizing-apache-kafka-connect-cdcd8ec6228

Databricks has an overview of the new Apache Avro features from Apache Spark 2.4, including native support via to_avro and from_avro, performance improvements, and support for Avro logical types (like decimal).

https://databricks.com/blog/2018/11/30/apache-avro-as-a-built-in-data-source-in-apache-spark-2-4.html

AWS re:Invent was this week, and there was a good presentation on the new features of Amazon EMR. The slides cover options for running a metastore (including Glue and a standalone RDS instance), Jupyter notebooks on EMR, cluster autoscaling, and more.

https://www.slideshare.net/AmazonWebServices/a-deep-dive-into-whats-new-with-amazon-emr-ant340r1-aws-reinvent-2018

TLDR Newsletter

Since before I started this newsletter, I've consumed a lot of news via weekly newsletters. I recently discovered TLDR, which is sent out daily. It's a high-quality list of concise summaries and links to stories pertaining to what's happened to tech in the past 24 hours. If you either feel like you're missing out on news from sites like twitter, hacker news, and reddit (or if you spend too much time checking them each day!), then the TLDR newsletter is a fantastic way to keep on top of things.

https://www.tldrnewsletter.com/

News

The team at Pachyderm raised a round of Series A funding, and Datanami has a good post on how the platform (which is meant as an alternative to Hadoop) differentiates itself. For instance, Pachyderm embraces Docker and Kubernetes and its filesystem includes version history for data and provenance.

https://www.datanami.com/2018/11/20/inside-pachyderm-a-containerized-alternative-to-hadoop/

HP Enterprise is acquiring BlueData, makers of the EPIC software for deploying big data services via Docker containers.

https://www.bluedata.com/blog/2018/11/hpe-and-bluedata-joining-forces-in-ai-ml-big-data/

Dremio's work to add LLVM execution to Apache Arrow (known as the Gandivia Initiative) has been contributed back to the Apache Arrow project.

https://www.datanami.com/2018/11/28/dremio-donates-fast-analytics-compiler-to-apache-foundation/

Amazon announced a public preview of Amazon Managed Streaming for Kafka (MSK). This post is one of the first reviews of the service, and it gives you a good idea of its features (like the version of Kafka) and pieces that are missing.

https://medium.com/@stephane.maarek/an-honest-review-of-aws-managed-apache-kafka-amazon-msk-94b1ff9459d8

Sponsor

6 months, fully online, at your own pace course
400+ hours of coursework + deeply technical projects
1:1 mentoring from AI experts at companies like Google and Facebook

With support from their career services team, they confident that you will find a job within six months of completing the course. But if you don’t, they'll refund your tuition.

Learn more today: http://bit.ly/springboard-mle

Releases

Databricks Runtime 5.0, which is based on Spark 2.4, is out. There's also a new version of the streaming source for Apache Kafka and Azure Blob Storage file notification.

https://databricks.com/blog/2018/11/18/announcing-databricks-runtime-5-0.html

Esper 8.0.0 was released. Its Streaming SQL engine, which includes Complex Event Processing, now includes a SQL-to-JVM byte code compiler.

http://www.espertech.com/2018/11/21/compilingstreamingsqltobytecode/

Apache Kafka 2.1.0 was released. It adds Java 11 support, zstandard compression and includes lots of improvements to the streaming API, admin script, and DNS handling. In total there are 28 KIPs and 179 JIRAs in the release.

https://lists.apache.org/thread.html/973794a73323ca53dd5a71005e8765b780ffd621db63b6fb77ef1878@%3Cannounce.apache.org%3E

Qubole has announce release 54, which includes new versions of several open source projects (e.g. Apache Spark and Apache Airflow) and several new features for Microsoft Azure.

https://www.qubole.com/blog/release-54/

Apache AIrflow 1.10.1-incubating has been released. New features include the ability to delete a DAG from the web UI, integration with AWS SageMaker, and operators for Google Cloud Functions.

https://github.com/apache/incubator-airflow/releases/tag/1.10.1

Version 1.0 of Apache Samza was announced. The LinkedIn blog has a post describing Samza's major features, including the new ones in version 1.0 (including improved testability and the ability to run Samza outside of YARN). It also talks about some plans for the future.

https://engineering.linkedin.com/blog/2018/11/samza-1-0--stream-processing-at-massive-scale

Amazon Kinesis Data Analytics for Java is a new library for building streaming applications. It's based on Apache Flink.

https://aws.amazon.com/blogs/aws/new-amazon-kinesis-data-analytics-for-java/

Version 1.0.0 of Dask, the parallel computing and task scheduling system for Python, has been released.

https://blog.dask.org/2018/11/29/version-1.0

Apache Flink 1.7.0 adds Scala 2.12 support, exactly-once S3 StreamingFileSink, several new features for streaming SQL, the Kafka 2.0 connector, and much more.

https://flink.apache.org/news/2018/11/30/release-1.7.0.html

FoundationDB, the distributed database that was open sourced by Apple (after they acquired the company behind it), announced version 6.0.15. It's the first major release since open sourcing, and the major new feature is multi-region support with seamless failover. On the heels of the announcement, the FoundationDB team also announced a new "Document Layer" that provides an API that's compatible with the MongoDB protocol.

https://www.foundationdb.org/blog/foundationdb-6-0-15-released/
https://www.foundationdb.org/blog/announcing-document-layer/

Events

Curated by Datadog ( http://www.datadog.com )

California

Vespa + Bullet (Sunnyvale) - Wednesday, December 5
https://www.meetup.com/hadoop/events/256792568/

Centralized Metadata + Build Apps w/o Pipelines + Docker & Kubernetes (Sunnyvale) - Wednesday, December 5
https://www.meetup.com/BigDataApps/events/256358337/

Demystifying AWS Analytics + Building Streaming Analytics Using Kafka & Druid (Santa Clara) - Thursday, December 6
https://www.meetup.com/datariders/events/256026290/

Texas

Advanced Analytics with Apache Spark (Katy) - Saturday, December 8
https://www.meetup.com/Katy-Data-Analytics-Machine-Learning/events/256044143/

Missouri

Data Ingestion at Scale (Saint Louis) - Wednesday, December 5
https://www.meetup.com/St-Louis-Big-Data-IDEA/events/254448349/

Pennsylvania

PGH's 1st Kafka Meetup: Fotios Filiacouris from Confluent (Coraopolis) - Thursday, December 6
https://www.meetup.com/DICKS-Sporting-Goods-Tech-Talk-Thursdays/events/256776787/

Massachusetts

December Presentation Night (Cambridge) - Tuesday, December 4
https://www.meetup.com/Boston-Apache-Spark-User-Group/events/255479516/

Learn How Voya Financial Uses StreamSets (Cambridge) - Thursday, December 6
https://www.meetup.com/Boston-StreamSets-User-Group-Meetup/events/255400639/

BRAZIL

Apache Kafka: From Messaging to a Streaming Data Platform (Sao Paulo) - Friday, December 7
https://www.meetup.com/Sao-Paulo-Apache-Kafka-Meetup-by-Confluent/events/256438308/

UNITED KINGDOM

Data Matters with Jess Anderson (London) - Wednesday, December 5
https://www.meetup.com/skillsmatter/events/256173898/

NETHERLANDS

Streams, Financial Alerts & Self-Service on Apache Kafka (Utrecht) - Thursday, December 6
https://www.meetup.com/Kafka-Meetup-Utrecht/events/256384621/

APIs, Q&A, and Lessons Learned after Using Databricks @ Dynniq (Amsterdam) - Thursday, December 6
https://www.meetup.com/Amsterdam-Spark/events/256625974/

POLAND

SysOps/DevOps Katowice MeetUp #3 (Katowice) - Wednesday, December 5
https://www.meetup.com/SysOpsPolska/events/256566122/

Data Meetup (Warszawa) - Wednesday, December 5
https://www.meetup.com/Data-Boardgames/events/256405987/

SERBIA

Let's Talk about Apache Kafka and What's New in Java (Nis) - Thursday, December 6
https://www.meetup.com/NisJUG/events/256596099/

BOSNIA AND HERZEGOVINA Apache Kafka: Central Data Pipeline (Sarajevo) - Wednesday, December 5
https://www.meetup.com/Sarajevo-Lets-talk-about-IT/events/256566813/

AUSTRALIA

December Sydney Data Engineering Meetup (Surry Hills) - Wednesday, December 5
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/255633946/

The Art and Science of Capacity Planning with Gwen Shapira (Melbourne) - Wednesday, December 5
https://www.meetup.com/KafkaMelbourne/events/256539202/

Software Engineering Sydney December Meetup (Sydney) - Thursday, December 6
https://www.meetup.com/Software-Engineering-Sydney/events/256741971/