Data Eng Weekly

Data Eng Weekly Issue #252

18 February 2018

Lots this week on stream processing including coverage of the Pravega streams system, exactly-once in Apache Flink, new features in Hortonworks Data Flow, getting started with Pivotal Cloud Data Flow, and building an application with Confluent KSQL. Qubole also has a great post on some optimizations they've made to query performance in Presto. In releases, Apache Oozie, Apache Storm, and Apache Flink all have new versions out this week.


The Pravega project came across my radar for the first time. Open-source and from Dell EMC, it's a distributed system implementing streams with similarities to Kafka and Apache Pulsar. Key differentiators are automatic movement of cold data to HDFS or other tier two storage and auto scaling of segments. That auto scaling functionality is one of the topics of the following post, which also looks at the API for sending to and consuming from Pravega.

This is a good article on how to pick the right tech stack to quickly stand up a data warehouse. AWS provides the plumbing with Kineses, Redshift, Glue, Lambda, and more. Lots of good tips if you go down a similar route or are using these (or related) technologies in AWS.

Qubole has implemented two optimizations for Presto—join reordering and dynamic filtering. The post describes how these improvements are implemented and how they improve performance in certain situations. The article also details performance results from an analysis with TPC-DS queries (2.8-14x speedup and several more queries run to completion than before). While these optimizations are available in Qubole Presto, they're also working with the community to get them into the main branch.

If you're doing any SQL database programming from Scala, Doobie looks like a useful library for writing your JDBC queries. It enables writing of raw SQL queries but has a bunch of functionality, including to convert results to case classes (with checks for types) and to execute prepared statements. This post has a good overview of how to get started.

Another open-source project that's new to me—Arango—is a NoSQL database that supports different types of data, including graph, key/value, and document storage. This post describes the results of some recent benchmarking against postgres, mongodb, neo4j, and orientdb. While there are good disclaimers in the post, it's always important to benchmark with your own use cases and data. With that said, for a single-node use case the results are impressive.

If you're using Pivotal Cloud Foundry or Spring, Spring Cloud Data Flow might be a great way to get started with stream processing. This post gives a brief tour of how to get setup (installing the various components with the cf tools) and build a simple log parsing application with the Data Flow shell.

Apache Flink's checkpointing has provided exactly-once semantics within a Flink application for some time now. With the 1.4.0 release, they've also added the ability to ensure exactly-once delivery to a Apache Kafka or Pravega data sink. This post details the two-phase commit implementation that powers the exactly-once delivery.

Hortonworks Data Flow (HDF) includes the open-source Streaming Analytics Manager for UI-driven definition of streaming applications. HDF 3.1 added a test mode (with fake data) and the ability to unit test these applications. This post describes how to do both.

Confluent has a post that describes (complete with lots of code examples) consuming change data capture events from an Oracle database, applying a number of transformations and aggregations with KSQL, and storing the resulting data in Elasticsearch for analysis with Kibana. There's lots of great stuff in here, but it does a particularly good job of demonstrating the differences between streams and tables as well as between event and processing time.


Hey, Data Eng readers: Which big data company’s location tech is embedded in 125K services and apps, from Apple to Uber? The answer—Foursquare. We have a 16 PB cluster that runs 10k jobs a day on Spark, Scalding and Presto. And we’re looking for engineers.

Our website:


dotScale, a conference on distributed systems and scalability, takes place in Paris from May 31-June 1. Readers of Data Eng Weekly can get 20% of a ticket with promo code DATAENGWEEKLY.

Datanami predicts that 2018 will be the year of data engineer (I guess I was right to rename the newsletter!). It notes some relevant stats, such as job postings for data engineers outnumbering those of data scientists by 4-5x.


Hello Fresh: Change the way people eat forever. Work with our data technology to deliver healthy meals to millions of customers, with a cutting-edge tech stack (Hadoop, Kafka, Impala, pyspark, AWS, Airflow) and time for personal and engineering development. Click the link for more info on becoming a Data Engineer at Hello Fresh in Berlin!


Apache Oozie, the workflow engine, disclosed a vulnerability this week. There's a new 4.3.1 release out with a mitigation along with several other bug fixes and minor improvements.

Qubole and Snowflake have a new integration for using Qubole to query data from and write data to Snowflake. It leverages the Snowflake Spark connector and is integrated into the Qubole console.

Version 1.4.1 of Apache Flink, the big data stream processing system, was released. It contains 60 fixes/minor improvements.

Apache Storm 1.2.0 was released with improved Kafka integration and a new metrics systems built on dropwizard metrics. It also adds a new HBase state backend.

Apache Storm 1.0.6 and 1.1.2 were also released. Both releases include improvements to the Kafka integration and several other bug fixes.


Curated by Datadog ( )



Morning Workshop with Confluent and IBM (San Francisco) - Tuesday, February 20

Monitoring Apache Kafka with Gwen Shapira (Foster City) - Tuesday, February 20

Engineering Real-Time Event-Driven Processing (Palo Alto) - Thursday, February 22

Kubernetes Day: Running Apache Spark, Apache Pulsar & Heron (Santa Clara) - Saturday, February 24


Messaging, Storage, or Both: The Real-Time Story of Pulsar & Apache DistributedLog (Boulder) - Thursday, February 22


Big Data Monthly Meetup (Minnetonka) - Wednesday, February 21


Ebb & Flow: Data in Motion (Milwaukee) - Thursday, February 22


Macy’s Omni-Catalog: A Real-Time Fast Data Story Using Spark & Cassandra (Johns Creek) - Wednesday, February 21

Intro to Building a Distributed Pipeline for Real-Time Analysis of Uber's Data (Atlanta) - Thursday, February 22


Microservices and End-To-End Topologies with Kafka (London) - Wednesday, February 21

Panta Rhei: Designing Applications with Distributed Streams (London) - Wednesday, February 21


Let's Start the Season with the Spark Frameless Library (Barcelona) - Thursday, February 22


From a Little Spark May Burst a Flame (Amsterdam) - Tuesday, February 20

Data Processing Mayhem (Amsterdam) - Thursday, February 22


Discuss EU GDPR and Spark on AWS (Vilnius) - Thursday, February 22


Create Data Pipeline Demo: MySQL-Kafka-Elastic Stack (Istanbul) - Tuesday, February 20


Manila Big Data Tech Meetup (Manila) - Wednesday, February 21


Enterprise Grade Hadoop & High-Performance Applications with HPE and Aerospike (Melbourne) - Wednesday, February 21


Easy and Fast Stream Processing (Christchurch) - Monday, February 19