Data Eng Weekly

Data Eng Weekly Issue #257

25 March 2018

This week's issue is quite large—great technical content covering BigGraphite (a Cassandra backend for Graphite), Apache Spark Streaming, Postgres on Kubernetes, KSQL, and a DynamoDB/API Gateway solution for serving ML data. In releases, there are a bunch, including Apache Trafodion, Apache Kudu, Apache Drill, and Apache HAWQ. There is hopefully something for everyone, but if you're looking for even more then checkout @DataEngWeekly on twitter throughout the week.


Cut HBase storage needs by 50%; elevate job throughput by 260%. ScaleFlux Computational Storage cohesively optimizes your Hadoop HBase infrastructure. Significant performance leaps on Spark, RocksDB, and Aerospike as well. Now shipping - see more:


Last week, I included the wrong link to the GO-JEK article—here's the right one. In addition to that post on their data infrastructure (built on Apache Kafka, Apache Flink, Kubernetes, Chef, and InfluxDB), there's another post from their blog. This second article has a great overview of some resilience strategies—timeouts, idempotency, and the circuit breaker pattern—in distributed systems.

The OpenMessaging Project is a new initiative to 1) provide unifying APIs across messaging systems and 2) provide standard tools for benchmarking those systems. It is able to generate synthetic workloads that are tunable on several axes and comes with tools to provision Apache Kafka and Apache Pulsar on AWS.

Criteo has open sourced a new Apache Cassandra backend for Graphite called BigGraphite. With it, they store over 1 million metrics per second, and they take advantage of Cassandra's secondary indexes to support the different types of Graphite queries. This presentation goes into further detail on the new backend and also links out to the design document and code on github.

Apache Spark 2.3.0 adds experimental support for continuous mode in stream processing. Compared to microbatch latency of 100s of milliseconds, continuous mode uses long-running processes that can take a record end to end in 10s of milliseconds.

The Pivotal blog has a post from Crunchy Data, makers of an open source tool for running Postgres on Kubernetes. The article provides a brief introduction to the tool, a tutorial for getting a DB up and running, and a demo of scaling up the number of read replicas.

A tutorial on the Confluent Blog covers loading data into Kafka using Kafka Connect for MySQL (via Debezium for change data capture) and from CSV files, querying and joining those two streams, and writing the results out to Amazon S3 using the Kafka Connect S3 sink.

This article enumerates several of the mechanisms that can be used to access data in Apache Kudu (such as Impala and Spark), and it provides some performance comparison to HDFS when using the TPC-H synthetic dataset. While it's always important to test with your own data set, the experiment shows that Impala is nearly as fast at querying both parquet data in HDFS and Kudu (with some exceptions). Kudu, additionally, has the benefit of efficiently supporting CRUD and random access via a primary key.

Great overview of Impala best practices—such as using Parquet, optimizing file size (in this case, they saw a 2,000x speedup by combining small files), configuring separate coordinators and executors (a new feature as of Impala 2.9), and computing statistics. The post also describes a strategy of automatically detecting and canceling long running queries.

Azure Event Hubs is a managed service for storing data for streaming applications (with similarities to Apache Kafka or Google Cloud PubSub). This post describes using Event Hubs with Apache Spark (via Azure Databricks) to stream data from the Twitter API into Event Hubs and then to consume the tweet data from Event Hubs.

This post has five tips for working with data—if you haven't yet learned these the hard way yet, then they're good advice.

Rue La La is using AWS DynamoDB and API Gateway for serving data generated by their data science team via Spark jobs. This post provides an overview and links out to two other posts: the first on how store data in DynamoDB for efficiently answer API queries (including over a 20B element sparse matrix), and the second on using AWS Batch and Lambda to load data from S3 to DyanmoDB (there are some interesting details on long-running S3 reads and efficiently loading data).

This article describes how one org does non-destructive schema changes using Rail's ActiveRecord migration utilities. They've codified a number of these best practices, to the point where exceptions are raised it you try to drop a table or column.


The Data Engineering Podcast had its 23rd episode this week—it covers Elasticsearch and the Elastic stack.

Apache Airflow has gained so much popularity that there's now a podcast from the folks at Astronomer. This episode describes Airflow best practices.

This post defines several job areas related to data—Data Analytics, Data Science, Machine Learning Engineering, Data Engineering, Backend Engineering, and Devops. It also describes how they overlap with a 6-circle Venn diagram. Whether or not these are the exact definitions you use at your company, it's useful to have these clear definitions to help identify gaps or blind spots.

PhoenixCon, the conference on Apache Phoenix, has announced that they're holding the next instance in San Jose on June 18th. The CFP is open through April 16th.


Astronomer recently announced Astronomer Cloud Edition—a managed Apache Airflow service.

Spring Cloud Data Flow 1.4 is out. The new version is mostly a UI/UX refresh, which includes a new builder for configuring streaming applications. There are also some security feature enhancements in the release.

Kafka Streams Viz is a new tool for visualizing streaming topologies. It's a single page webapp that takes a plain-text definition of a streaming topology and outputs a user-friendly diagram.

The first of four SQL/data query engine Apache releases this week, Apache Trafodion 2.2.0 is the also the inaugural release since the project graduated from the Apache incubator. This is a bug fix release with a few enhancements.

Apache Drill 1.13.0 is out with YARN support for Drill, support for JDK 8, improvements to security and memory usage, and more.

Apache HAWQ was announced. This was a bug fix and feature enhancement release that adds support for HDFS transparent data encryption and improved Ranger support (kerberos authentication and RPS HA).

Apache Kudu 1.7.0 was released. New features include support for the decimal data type, improvements to self-healing through re-replication, and a new READYOURWRITES scan mode. There are a bunch of improvements to the Java client, performance during inserts and updates, and more. See the release notes for full details.

The 0.3.0 release of Banzai Pipeline, the PaaS system for deploying Spark, Kafka, and others on Kubernetes, is out. This version adds support for Google Kubernetes Engine, strengthens security, and includes several bug fixes and improvements.

Databricks and Microsoft announced that Azure Databricks is now generally available.


Cut HBase storage needs by 50%; elevate job throughput by 260%. ScaleFlux Computational Storage cohesively optimizes your Hadoop HBase infrastructure. Significant performance leaps on Spark, RocksDB, and Aerospike as well. Now shipping - see more:

Interested in sponsoring? See: for details and email


Curated by Datadog ( )



WePay Engineering Meetup: Managing Data at Scale (Redwood City) - Wednesday, March 28

Bay Area Apache Spark & Women in Big Data @ Databricks HQ (San Francisco) - Thursday, March 29


An Introduction & Dive into Apache Kafka and the Power of Messaging Systems (Phoenix) - Thursday, March 29


Stealing the Jewels from Python: Putting PySpark Code into Production (Chicago) - Monday, March 26


Event Sourcing from Scratch with Apache Kafka and Spring (Pittsburgh) - Tuesday, March 27


March Presentation Night: Spark 2.3, Cloud, and Use Cases (Cambridge) - Wednesday, March 28


Reactive Applications with the SMACK Stack, Led by Duncan DeVore! (Kanata) - Thursday, March 29


Mesosphere and Confluent at Our First Kafka Meetup (Manchester) - Wednesday, March 28


Apache Kafka March Meetup (Paris) - Monday, March 26


IoT Stream Processing and Analytics in Real-time (Frankfurt) - Wednesday, March 28


Let's Talk about KSQL for Kafka (Zurich) - Tuesday, March 27


Get Started with Data Integration Using Apache NiFi (Vienna) - Tuesday, March 27


Big Data: How to Deal with Sensorics Data Easily (Prague) - Tuesday, March 27


Big Data on Kubernetes (Budapest) - Tuesday, March 27