Data Eng Weekly

Hadoop Weekly Issue #233

17 September 2017

Lots of great technical content this week, including posts on Kafka, SparkR, and Amazon EMR. And if you're looking for more, the Kafka Summit videos and slides are online. In releases, Kudu, Impala, Kafka, and Storm all have new versions out this week.


While most of the articles I highlight target distributed systems, this one covers some python tools for training a model and serving results. It focusses on the data engineering aspects of that work—from how to get started with a simple example to scaling up in a production environment.

Confluent has written about the testing of Apache Kafka. They highlight that before any code is written, the design is "tested" through an open Kafka Improvement Process. From there, code undergoes unit tests, integration tests in a single process, and system tests that involve performance and correctness testing across multiple instances and with injected faults and more.

This post provides an interesting overview of the types of IO supported by the Linux kernel (vectored IO, memory mapping, and async IO) and the trade-offs between them. While not necessarily useful everyday, many of the data systems covered in this newsletter make use of some of the advanced features of mmap like fadvise.

Good tutorial walking through how to use R's dplyr library to analyze data about bike trips in NYC. For scalability, the application makes use of Amazon EMR and Spark.

Hortonworks has done a performance analysis of Apache HBase and Apache Cassandra using the Yahoo Cloud Serving Benchmark. For the testing, the services are configured to read and write data on AWS' attached SSD storage. Unsurprisingly, HBase was faster for reads and Cassandra performed better when workflows are write-heavy.

Confluent and Kafka co-founder Jay Kreps writes about several use-cases in which Apache Kafka is a good choice for your data's source of truth. These include a centralized log of changes, powering of in-memory cache for online systems, kappa architecture use-cases, and change data capture. Jay argues that Kafka is the commit log for the datacenter but at the same time it won't replace traditional databases—Kafka doesn't plan to support arbitrary queries.

Amazon EMR has enabled the ability to build a cluster with a custom-built AMI. This post walks through what's needed to build and use a custom AMI for EMR as well as what some of the benefits are.


Spark Summit Europe takes place in October 24-26 in Dublin. The schedule is available on the conference website. It features speakers from a number of startups and even more larger companies.

Slides and videos from the recent Kafka Summit have been posted on the Confluent website. Access is behind an email/phone form.


Qubole has announced general availability of their AIR (Alerts, Insights, Recommendations) service. AIR does usage based ranking and context aware suggestions when writing a query, enables search across column and table names, provides usage reports, statistics, data preview, and provides actionable recommendations for improving data models.

IBM has released a sandbox for trying out its Big SQL platform, which is based on HDP 2.6.2, using a single docker image.

Apache Kafka was released, which contains a number of bug fixes and minor improvements.

Version 1.5.0 of Apache Kudu has been released. While it's a minor release, there are new features like the ability to tolerate disk failures at startup and improvements to client tools (e.g. exporting CSV files and a tablet move operation). There are also a number of optimizations and bug fixes. The release notes contain some details for anyone considering the upgrade.

Apache Impala (incubating) has released version 2.10.0. The release contains over 250 tickets for new features, improvements, bug fixes, and more.

Apache Storm 1.0.5 was released with seven bug fixes.


Curated by Datadog ( )



BDAM: Rules Engine, Apache Airflow & Exactly-Once Processing with Apache Kafka! (Palo Alto) - Wednesday, September 20 Apache Spark, Apache Flink, and Apache Ignite: Where Fast Data Meets the IoT (San Francisco) - Wednesday, September 20


Spark Structured Streaming: Introduction and Internals (Bellevue) - Wednesday, September 20


Spark ML with Holden Karau + Building a Recommendation Engine with (Chicago) - Thursday, September 21


Intro to Alluxio and Spark (Philadelphia) - Thursday, September 21

New York

Best Practices Building Enterprise Data Infrastructure with WeWork (New York) - Tuesday, September 19


Streaming Data Pipelines and Kafka as a Message Queue (London) - Wednesday, September 20


Workshop: Linking Hadoop with Classic DBMS (Madrid) - Monday, September 18


Our First Kafka Meetup with 2 Amazing Speakers from Confluent (Zurich) - Tuesday, September 19


Back2Spark Meetup: AgileRai and SparkSearch! (Milan) - Wednesday, September 20


Open Source Tools for Big Data (Helsinki) - Tuesday, September 19


Deep Dive into Apache Ranger and Atlas (Budapest) - Tuesday, September 19


September Meetup: Real-Time Transaction Streaming and Big Data in Bioinformatics (Bucharest) - Tuesday, September 19


Riding the Streaming Wave with Kafka (Athens) - Tuesday, September 19


Why Apache Kafka? (Ankara) - Wednesday, September 20


Spark Meetup @ DataWorks Summit (Sydney) - Tuesday, September 19

Apache NiFi and MiNiFi: Edge to Core (Sydney) - Tuesday, September 19