Data Eng Weekly

Data Eng Weekly Issue #296

06 January 2019

Another short issue this week, but some good variety of content covering Spark, Redshift, Druid, MapR DB + Apache Drill, and more. In news, the Cloudera+Hortonworks merger has closed, and in releases, there are a couple of new open source projects to check out.


This post gives an example of using the builder design pattern for defining Spark Schemas. The descriptive nature of the builder provides some nice readability improvements over the verbose definition using Spark's StructType.

The Amazon Payments team has written about their data warehouse, which is built on Amazon Redshift. They have three clusters for staging data, production ETL with tight SLAs, and user/BI queries. They share a lot of details about the architecture of the system (e.g. how they shuffle data between system, tiered data storage using S3), the size of the compute and storage powering their cluster (there are visualizations of the size distribution of their tables and query load), and share some of the best practices that they've implemented.

An example of using PySpark to analyze httpd access logs and visualize the results with matplotlib.

This tutorial shows how to use the Kafka Connect Datagen connector to generate test data from an Apache Avro schema defintion. There are several schemas (e.g. Users, clickstream, and purchases) bundled with the tool.

This walkthrough covers rebuilding the Druid hadoop-index utility to ingest data from Amazon S3 using the S3A file system.

The MapR blog has a post describing how the combination of MapR DB and Apache Drill efficiently execute SQL queries by taking advantage of secondary indexes. The post describes the Drill query planner and provides several examples.

This post describes several strategies for optimizing Apache Spark based on a presentation by Facebook at Spark + AI Summit. Spark components covered include the driver, executor, and the external shuffle service.

This tutorial describes a strategy for building up a complex SQL query by starting with something simple and checking your results before adding more complexity.

This article describes how Buffer loads data from MongoDB into BigQuery in realtime. Since a MongoDB collection doesn't have an explicit schema, there's a bit of discussion of how they've architected to add and update the BigQuery schema.

A good overview of the monitoring tools available for Google Cloud Bigtable, including for identifying hot nodes and visualizing usage patterns by key.

This post describes how one organization uses Apache Hive with Amazon Redshift—doing some munging in Hive and performing aggregations in Amazon Redshift. It's a process that they call ETLT.

Pretty neat visualization of the Raft distributed consensus algorithm in the browser. The code is open source, which this blog post walks through.


The Cloudera and Hortonworks merger closed earlier this week. ZDNet has a good recap of history and the road ahead for the new combined company. There were also posts directly from the new joint team on the Cloudera and Hortonworks blogs.

A great recap of how the role of the Data Engineer has changed thanks to new tools, and where data engineering is now adding value at a startup. Examples include monitoring jobs, tuning table schemas, and other maintenance tasks. The article has lots of good ideas based on the well-informed perspective of the CEO of a company that works on data engineering problems with other startups.


Version 0.6.1 of Wallaroo has been released. New features include a windowing API for count-based and ranged-based windows, aggregations, and more.

Apache Drill 1.15.0 was released, with new index support, security improvements, and more.

Apache HBase version 2.0.4 was announced. It resolves a critical issues and includes over two dozen additional bug fixes and improvements over the 2.0.3 release.

Furnace is a new stream processing system built on Serverless/FaaS, which aims to be cloud agnostic. The first release is built to support AWS and Node.js.

Kafkawize is an open source web management platform for Apache Kafka. It provides tools for requesting new topics, ACL changes, and a queue for a team to review those requests (and apply them). Kafkawize uses Apache Cassandra for persistenc.


Curated by Datadog ( )


What's New in Apache Hive 3.1? (Milwaukee) - Thursday, January 10


Spark Scala (Annapolis Junction) - Thursday, January 10


Running a Cluster in the Cloud and ATM Fraud Detection with Kafka and KSQL (Leeds) - Thursday, January 10


Apache Flink Meetup Munich @ Wayra (Munich) - Wednesday, January 9

Building Microservices with Kafka Streams: Beyond Kafka-For-Messaging (Munich) - Wednesday, January 9

Streams, Tables, and Time in KSQL (Berlin) - Thursday, January 10

Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL (Frankfurt) - Thursday, January 10


Kafka Streams and Scala: Developing Stream Processing Applications (Warsaw) - Tuesday, January 8


Data Engineering Demystified + Using AI to Optimize BigQuery (Tel Aviv-Yafo) - Wednesday, January 9


Building Streaming Data Pipelines with Kafka, Elastic, and KSQL (Bangalore) - Wednesday, January 9