Data Eng Weekly

Data Eng Weekly Issue #302

18 February 2019

A bunch of posts this week on Spark and Kafka, and also some really interesting posts on speeding up database testing, AWS Glue for batch processing, defining scalability, orchestration vs. choreographing a data pipeline. There are also several must reads in news—on SQL, the best talks of 2018, and the evolution of data science.


With docker and related tools, it's quite easy to run integration tests that talk to a database. This post describes how to speed up those tests by starting up your database with fsync disabled or using a ramdisk. If you're running your DB as a docker container, it's pretty easy to do either one (see examples in the post).

Smile is a binary format that is based on JSON, but is much more efficient. This post provides a quick introduction to the format, some example serialization sizes, and a Java code example.

Term-Frequency, Inverse Document Frequency is the classic algorithm for searching for terms in a document. It was historically used in a lot of tools, but Elasticsearch and Solr both offer a lot of new strategies. Of those, this post covers several: Best Matching, Divergence from Randomness, Divergence from Independence, Information-Based, and Language Models.

Qubole writes about their dynamic filtering feature for their distribution of Spark, which improves performance of join queries by (semantically) adding additional filter predicates to the query to reduce the amount of data in the join.

A two-part comparison of implementing common algorithms (finding primes and matrix multiplication) using MPI (a two decade old HPC framework) and Apache Spark. MPI wins in performance, and Spark wins in ease of algorithm implementation.

This post covers the redBus Data Platform, which uses AWS Glue for coordinating batch processing & crawling datasets in S3 and Apache Drill for querying data.

Lots of details of the JDBC connector for Apache Kafka Connect. The article covers things like incremental ingestion, debugging common issues, custom data types, and importing from multiple tables.

This Flink blog post describes the improvements that are being merged into the Flink codebase to improve performance and features for batch processing. By taking advantage of some tradeoffs of batch processing, they can get better performance, fault tolerance semantics, scheduling, and SQL support. The post has lots more details about these changes.

Scalability is sometimes a hand-wavy term that means different things to different people. This article provides a good overview of the types of scalability (size, geographical, and administrative), and it provides several easy to follow examples.

An interesting take on orchestration vs. choreography for a data pipeline, and how the tradeoffs are similar to those of a monolith vs. a microservices architecture.

This post describes the fundamentals of an event streaming architecture, including how it compares to reactive event-driven architectures (e.g. actor frameworks), how it relates to microservices, and more. This article is post 2 of a 4 part series.

A look at how one company uses the ELK stack to analyze usage of big data systems (both ad hoc and scheduled) to identify resource intensive queries.

ThousandEyes writes about some lessons learned working with Apache Kafka. Specifically, they cover scenarios in which Kafka topics experienced spiky disk usage and older than expected data was showing up in consumers (based on the delete or compact,delete policy set for a particular topic). They describe each of these in detail and suggest some configuration changes/lessons learned.

State Synchronizer is component of Pravega, the streaming storage system, that provides a shared object to Pravega clients. This post describes how it compares to systems like Apache ZooKeeper, how it's used in Pravega to implement ReaderGroups, and the API/semantics it exposes to clients.

A look at how Apache Pulsar implements tiered storage, which can provide significant cost savings by leveraging an object store.

For Apache Spark applications, it's often important to understand and optimize memory usage. This post looks at the main classes used by the Spark executors to dynamically allocate memory, which are important parts of understanding the overall memory model.


A curated list of great talks (videos and/or slides) from 2018. Lots of topics covered, and there are many on distributed systems, networking, and other relevant topics.

ThoughtWorks describes how data lakes are susceptible to a "build it and they will come" mentality, and they recommend instead using a bottom-up, product/use-case driven approach.

An argument that SQL is the the most valuable skill that a developer can have for a few reasons, like how it's valuable across many different roles and disciplines.

Lots of great advice for folks considering a career in data science—how the job has evolved, and how a lot the skills needed to be successful as a data scientist involve data engineering and software engineering. Great recommendations to share with anyone you might know considering a job in data.


Streams Messaging Manager 1.2 was released with topic lifecycle management features, alerting, and schema registry integration.

Apache HBase 2.1.3 was released. It includes a number of bug fixes and improvements, most notably an upgrade to the Apache Thrift dependency to resolve a security vulnerability.

Apache Beam 2.10.0 has been announced. The release includes updates to lots of dependencies (including Apache Flink 1.6), new IOs (for Kafka, Hadoop, Mongo, Cassandra, and more). The Beam blog has more details on the these features.

PostgreSQL has announced a number of new releases for the 9, 10, and 11 lines. The releases are notable in that PostgreSQL has changed its implementation of fsync() for the first time in many years to resolve potential data consistency issues.

kafka-topic-manager is a new open source project providing a REST service to delete topics from a Kafka cluster. It helps to work around a bug in version 1.1.1 of Kafka by serializing the deletion of topics.


Curated by Datadog ( )


Kafka Is More ACID Than Your Database (Hollywood) - Wednesday, February 20


Why Data-as-a-Service Is the Next-Generation Data Platform (Bellevue) - Wednesday, February 20

Apache Flink with Hive, Tensorflow, Beam, and AthenaX (Seattle) - Thursday, February 21


Big Data Transformation: Moving from Hadoop and Data Streaming to Micro-Batch (Boulder) - Thursday, February 21


Modern Data Warehouses in the Cloud: Use Cases + Live Demo (Plano) - Tuesday, February 19


Event-Driven Architecture w/ Apache Kafka and Spring Cloud Stream (Indianapolis) - Wednesday, February 20


Big Data Architectures and the Data Lake (Linthicum Heights) - Wednesday, February 20


Streaming Data Pipeline with Kafka and Druid (Boston) - Tuesday, February 19

IRELAND Spark and Kafka for Data Analysis and Machine Learning (Cork) - Tuesday, February 19


Apache Kafka and KSQL in Action! (Bristol) - Tuesday, February 19

Updates on Stateful Stream Processing with Apache Flink 1.7+ (London) - Friday, February 22


Profiling and Caching Spark Applications with Qubole OSS (Barcelona) - Tuesday, February 19


A Tour of the Kafka Environnement (Montpellier) - Wednesday, February 20


Go Meets Messaging Systems (Sofia) - Tuesday, February 19

HONG KONG Productionizing Apache Spark for ETL (Wan Chai) - Wednesday, February 20


Apache Druid & Big Data Journey at Uber (Taguig) - Wednesday, February 20