Data Eng Weekly

Hadoop Weekly Issue #235

09 October 2017

Short and sweet issue this week with coverage of Apache Apex, a new serverless ETL library from Nextdoor, Kafka Streams at Pinterest, a Jepsen post on Hazelcast, and more.


Since its announcement, Apache Kudu has promised to power a number of interesting use cases. This article looks at one of those—using it with the Apache Apex stream processing system. The presentation covers the high-level architecture of both projects, how Apex supports effectively once processing, and more.

Nextdoor has written about how they're using AWS Lambda for Serverless ETL. Originally, it powered two use-cases: streaming logs into Elasticsearch and writing data to S3. For this and other use cases, they've built and open-sourced a new tool called Bender. Bender is written in Java and it's driven off of JSON and YAML config files.

This post describes how a unified analytics platform, such as Databricks, can power multiple use cases and developer personas. Using the Amazon public product ratings dataset, it shows how both an analyst and a data scientist can build reports and machine learning prediction algorithms (respectively) using the notebook features. It also describes using the Databricks dbml-local library for serving of model data and using Spark for stream processing. There are a few Databricks-specific parts to the post (such as the notebook API and scheduling), but in general the post is broadly applicable.

Kinesis Analytics is able to run SQL queries over streams if the data is in a suitable format with a clear schema. Some data, such as Apache HTTPD access logs and certain compressed data files aren't in that format and need to be translated. This post shows how to use AWS Lambda to do that transformation—there's an example in node.js.

The Pinterest Engineering blog has a post on how they're using Kafka Streams for real-time predictive stream processing. First, the post describes the problem—overdelivery of online ads. Next, it walks through how they use Kafka Streams to measure in flight spend to provide a better heuristic for determining if an ad should be shown to end users as part of an ad inventory system. The post also includes a few tricks they employed to increase throughput and efficiency of their application.

Another great post in the Jepsen series looks at HazelCast. It tests the behavior of AtomicLong, distributed maps, ID Generators, and other data types. Due to design flaws in the system, these types all have correctness issues in the face of network partitions.

A presentation from the LA Spark User Group covers everything SparkML—from how to use the various libraries to the SparklingML library which adds new ML algorithms to how to build your own estimator.


Pivotal has a post celebrating the graduation of Apache MADlib to be a top level ASF project. The article covers the origins and journey of the in-database analytics library.


Version 4.1 of HUE was released. It includes a over 250 bug fixes.

Mocked Streams is a library for testing Apache Kafka Streams. This week, version 1.4.0 was released.

Apache Flume 1.8.0 was released. It includes a number of bug fixes as well as several minor new features and improvements.

Version 1.4.0 of Apache NiFi is out. Per the release notes, it adds stability and new features, including support for Apache Knox, LDAP-based authorization, and several new processors.

Luigi 2.7.1 was released with updates for Redshift, BigQuery, ECS, and more.


Curated by Datadog ( )



Autopiloting Real-Time Stream Processing (San Ramon) - Wednesday, October 11

Introduction to Hadoop and MapReduce! (Sunnyvale) - Sunday, October 15


Kafka for Java Developers & Java Puzzlers (Pittsburgh) - Thursday, October 12

New York

Streaming Data at Scale (New York) - Wednesday, October 11


Open Blueprint for In-Stream Processing (Boston) - Thursday, October 12

IRELAND Apache Kafka with Confluent's Tim Berglund (Dublin) - Thursday, October 12


Kafka Streams API (Hinxton) - Tuesday, October 10

Processing Streaming Data with KSQL & Full-Text Search and Machine Learning (London) - Tuesday, October 10


Inaugurational Meetup! (Porto) - Thursday, October 12


Hadoop v3 and NiFi (Villeurbanne) - Wednesday, October 11


October Kafka Meetup (Utrecht) - Monday, October 9


Migrating Towards Stream Processing and Microservices (Berlin) - Wednesday, October 11


The Elastic Stack in the Logistics Industry and as an Alternative to Hadoop (Lausanne) - Wednesday, October 11


Sparkathon: Developing Extensions for Spark Structured Streaming in Scala (Warsaw) - Tuesday, October 10


Spark & Flink Meetup 5 (Hangzhou) - Saturday, October 14


Google Cloud Fundamentals: Big Data & Machine Learning (Brisbane) - Tuesday, October 10


Deep Dive with Azure Data Factory (Wellington) - Thursday, October 12