Data Eng Weekly


Data Eng Weekly Issue #304

10 March 2019

With two weeks worth of content to pull for this issue, there are a lot of great articles this issue. Topics covered include Apache Flink, Presto, FaunaDB, and Kafka. In news, there's a new public roadmap for Apache Flink, and an article about the continued strength of the Data Engineering profession.

Technical

The Apache Flink blog has a post describing Flink's monitoring internals, job- and jvm-specific key metrics to monitor, alert conditions for those metrics, and what might trigger the alert conditions.

https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html

Teads writes about how they use Amazon Redshift to power internal analytics systems with relatively low latency (<500ms in most cases). They have a custom-built Analytics Service that pulls data from BigQuery, performs some enrichment, and publishes new data marts to Redshift. The post describes how they decided to go with Redshift for this use case over two other managed services (BigTable and DynamoDB), and how they optimize RedShift for query latency and concurrency.

https://medium.com/teads-engineering/give-meaning-to-100-billion-events-a-day-part-ii-how-we-use-and-abuse-redshift-to-serve-our-data-bc23d2ed3e07

The Spring Framework has an integration with Apache Kafka that provides some good abstractions to eliminate boiler plate in a Java application. This post provides ab overview of these features and how to handle errors, map records to listeners based on header values, and more.

https://www.confluent.io/blog/spring-for-apache-kafka-deep-dive-part-1-error-handling-message-conversion-transaction-support

Prefect shows how the primitives of their workflow engine provide a lot of features with a small amount of effort. Their post that shows how to use Prefect to implement a standup bot covers features of the upcoming system like its functional API, configuration (and first class support for secrets), and execution logic.

https://medium.com/the-prefect-blog/prefect-a-first-look-e7f003277a9c

If you've worked with big data long enough, you've probably run into a slow query caused by data skew. This post describes how to improve performance when joining skewed datasets by precomputing a bin id, which is added to the join constraints. There's a new python package that implements the algorithm for PySpark.

https://medium.com/@itzikjan/spark-join-optimization-on-skew-data-using-bin-packing-afae73f68662

This article describes how serverless/functions-as-a-service (FaaS) fit with an event stream architecture, and the types of stream processing applications that work well with FaaS.

https://www.confluent.io/blog/journey-to-event-driven-part-3-affinity-between-events-streams-serverless

This tutorial has an interesting solution for enriching and indexing streaming data with the ELK (Elastic/Logstash/Kibana) stack. By using Logstash's JDBC input plugin, they stream results out of Apache Kafka using the Presto query engine. Presto supports lots of backends, so it's easy to join that streaming data to enrich it with data from other sources like a table in MySQL (as is demoed here).

https://medium.com/@ravishankar.nair/ultra-fast-indexing-profiling-and-exploration-of-unified-data-ab8af09e870e

AWS writes about the performance implications of the FileOutputCommitter (which renames files) for Amazon S3 (or other blob stores). They describe some of the performance improvements (which are built with the same strategy of the S3A file system committers) that they've built for Amazon EMR FS and Apache Spark+Apache Parquet. They have some benchmarks in the post—there are some significant speedups with these optimized committers.

https://aws.amazon.com/blogs/big-data/improve-apache-spark-write-performance-on-apache-parquet-formats-with-the-emrfs-s3-optimized-committer/

The Kudu blog has a good post on building out a hybrid storage strategy with Apache Kudu and HDFS. The data can be queried as a unified data source in Apache Impala by creating a view that captures both data sources. The post describes a sliding window strategy for slowing moving data from Kudu to (cheaper) HDFS storage.

https://kudu.apache.org/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala.html

Jepsen writes about FaundaDB, which implements the Calvin protocol for distributed transactions. If you're not familiar with Calvin, there's a good introduction in the article. There are lots of details that are hard to summarize in a sentence or two, but I find it interesting to see what kinds of tests a distributed databases can be taken through and what to learn from them.

https://jepsen.io/analyses/faunadb-2.5.4

Qubole describes the challenges of Spark's Structured Streaming Checkpointing to an object store (like Amazon S3), which shares some technical similarities to the above post on output committers. Qubole has an implementation that leverages some improvements in Spark 2.4.0 to avoid rename operations of these checkpoints with a blob store backend.

https://www.qubole.com/blog/structured-streaming-with-direct-write-checkpointing/

This post introduces a pattern for validating/evolving schemas of data when loading into BigQuery. There are several code samples and examples in the post.

https://medium.com/@bravnic/dataflow-dealing-with-bigquery-schema-change-64936b44ef3

Qubole has an overview of resource groups in Presto, including an introduction to soft & hard limits and the key configuration parameters. The post also includes examples and scenarios.

https://www.qubole.com/blog/configure-leverage-resource-groups-in-presto/

Pravega has a deep dive into the implementation and architecture of the Segment Store, which is the data plane (used for append/reads/etc) in their streaming storage system.

http://blog.pravega.io/2019/03/07/segment-store-internals/

ActionIQ writes about how they've implemented auto scaling for Luigi workers to improve throughput of their data pipelines. They use a clever strategy that adds workers based on the number of pending tasks and uses instance protection to ensure that a task completes before instance shutdown.

https://medium.com/actioniq-tech/an-infinite-fleet-of-plumbers-a56611ec10fb

Version 4.0 of Apache Cassandra, which isn't yet released, has a new feature called virtual tables. Similar to the proc filesystem in linux, it exposes system metrics through read-only tables that can be queried the same as an application data stored in Cassandra.

http://thelastpickle.com/blog/2019/03/08/virtual-tables-in-cassandra-4_0.html

Jobs

Senior Data Engineer (Spark), N26, Berlin https://jobs.dataengweekly.com/jobs/c202d274-c3df-4274-9ffb-77a749db5c3f

Software Engineer - Data Platform, Fitbit, San Francisco, CA https://jobs.dataengweekly.com/jobs/ab98f336-41d9-455f-b0a7-f9d5975e5975

News

The newly launched Data Council (Formerly DataEngConf) is in just over a month in San Francisco (April 17-18th). They are offering subscribers a $200 discount, using the code DataEngWeekly200.

https://www.datacouncil.ai/san-francisco

Spark+AI summit, which is takes place in April, has a new data engineering track. This post looks at some of the talks from that track.

https://databricks.com/blog/2019/02/25/a-guide-to-data-engineering-talks-at-spark-ai-summit-2019.html

The Apache Flink project has published a roadmap, which looks at the upcoming improvement proposals and key Jira issues that they're tracking across several different areas.

https://flink.apache.org/roadmap.html

Datanami notes that according to some job board and career site data, demand for Data Engineering continues to be strong. In fact, according to one metric, the rate of Data Engineering job titles is outpacing that of Data Scientist.

https://www.datanami.com/2019/03/05/data-engineering-continues-to-move-the-employment-needle/

Releases

Fluent Kafka Streams Test is a new library from bakdata for testing Apache Kafka Streams applications. It provides convenience functions/glue for inputs, processing, and outputs using a JUnit extension. Their introductory post provides a number of examples for different types of applications.

https://medium.com/bakdata/fluent-kafka-streams-tests-e641785171ec
https://github.com/bakdata/fluent-kafka-streams-tests

Version 1.6.4 of Apache Flink was released this week. The changes mostly fix bugs, but there are also some small improvements.

https://flink.apache.org/news/2019/02/25/release-1.6.4.html

The 0.9.2 release of Debezium, the change data capture tool, has been announced. It includes some fixes for Postgres, MySQL, and SQL Server connectors as well as a few new features and dependency updates.

https://debezium.io/blog/2019/02/25/debezium-0-9-2-final-released/

Apache Daffodil (incubating), which is a tool fo converting between legacy binary / fixed width formats and JSON/XML, had its 2.3.0 release. Daffodil implements conversion using the Data Format Description Language, which has specifications for lots of legacy data formats for industries like health care and finance (e.g. point of sale systems).

https://daffodil.apache.org/releases/2.3.0/

Version 1.0 of the KSQL JDBC driver has been released. In this version, all KSQL commands are supported via SQL queries over JDBC.

https://github.com/mmolimar/ksql-jdbc-driver/releases/tag/v1.0

Events

Curated by Datadog ( http://www.datadog.com )

California

Alluxio 2.0 Deep Dive + A Case of Real-Time Processing with Spark (San Mateo) - Thursday, March 14
https://www.meetup.com/Alluxio/events/259107976/

Missouri

How Kafka Has Become the Nervous System of a Modern Data Architecture (Maryland Heights) - Thursday, March 14
https://www.meetup.com/GatewayJUG/events/256983624/

Illinois

Apache Kafka: Optimizing Your Deployment (Chicago) - Thursday, March 14
https://www.meetup.com/Chicago-Area-Kafka-Enthusiasts/events/258868295/

Ohio

Cleveland Big Data Meetup (Mayfield Village) - Monday, March 11
https://www.meetup.com/Cleveland-Hadoop/events/257103076/

North Carolina

Utilizing Kafka to Create a Streaming ETL Platform (Charlotte) - Tuesday, March 12

https://www.meetup.com/ModernDevsCLT/events/259364077/

New York

Kafka at the New York Times and Datadog (New York) - Tuesday, March 12
https://www.meetup.com/Apache-Kafka-NYC/events/258355241/

How to Work with Kafka in Ruby (New York) - Tuesday, March 12
https://www.meetup.com/NYC-rb/events/254389937/

Massachusetts

Wayfair's Journey with Apache Kafka (Boston) - Tuesday, March 12
https://www.meetup.com/Boston-Apache-kafka-Meetup/events/258989900/

New Hampshire

Cloud Big Data, Data Science, and Machine Learning (Bedford) - Tuesday, March 12
https://www.meetup.com/CloudNH/events/257987353/

UNITED KINGDOM

The Blueprint Series: Principles of Modern Data Architecture (London) - Thursday, March 14
https://www.meetup.com/big-data-ldn/events/259267772/

NETHERLANDS

Discussions Around O16N & Data Ingestion Tooling (Amsterdam) - Thursday, March 14
https://www.meetup.com/Analytics-Data-Science-by-Dataiku-Amsterdam/events/258927356/

GERMANY

Our First Kafka Meetup in Nurnberg! (Nurnberg) - Wednesday, March 13
https://www.meetup.com/Nurnberg-Kafka/events/259392042/

Apache Kafka & the IoT (Frankfurt) - Wednesday, March 13
https://www.meetup.com/IoT-Hessen/events/256740814/

SINGAPORE

Apache Spark Test-Driven Development (Singapore) - Thursday, March 14
https://www.meetup.com/Spark-Singapore/events/259356318/

THAILAND

Evolution of the Data Pipeline in Agoda (Bangkok) - Thursday, March 14
https://www.meetup.com/Bangkok-Kafka/events/259540898/

NEW ZEALAND

Open Banking + Event-Driven Microservices Using Apache Kafka (Auckland) - Wednesday, March 13
https://www.meetup.com/Auckland-API-and-Microservices-Meetup/events/258686352/

Writing Big Data Pipelines: The Apache Beam Project (Wellington) - Thursday, March 14
https://www.meetup.com/Data-Driven-Wellington/events/258636496/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.