Data Eng Weekly

Hadoop Weekly Issue #238

29 October 2017

Great technical articles on testing at Hortonworks and Cask, Spark and SparkSQL, Amazon Glue, Flink at ING, and service discovery in distributed systems. There are also a bunch of relevant releases—especially for folks using Python as new versions of Pandas and the Anaconda distribution are out.


Hortonworks has written about LogAI, their tool for analyzing logs coming out of a test run of the HDP test suite. The system uses frequency, co-occurence, and other correlation models to highlight errors, stack traces, and other items. There's a web UI for exploring the interesting parts of the logs.

The Cask blog describes how they provision clusters for testing the Cask Data Application Platform across various Hadoop distributions. Using Coopr, their cloud provisioning tool and Atlassian Bamboo for CI, they run tests on an average of 25 new clusters (provisioned on Google Cloud Platform) each day.

The Streamsets blog has a good overview of the motivations behind the Confluent Schema Registry for storing Avro schema versions. Using tools from the StreamSets data collector, it walks through how the the schema-aware producers/consumers serialize/deserialize data.

This tutorial describes how to use Spark to read data from a CSV file, convert to a well-defined schema (in this case a Scala Case Class), and query the data using SparkSQL. There's also sample code to store the data in MapR-DB and read it back out.

StreamSets Data Collector can act as a nearly drop-in replacement for Apache Sqoop. This post describes how to migrate as well as some of the considerations.

The AWS big data blog has a tutorial for loading data into S3 using Amazon Glue. Once there, the data can be queried using Amazon Athena.

The more I read and hear about the Pony language, the more intrigued I am. This post from the authors of the Wallaroo stream processing framework talks about why they chose it for their system. Among the reasons are an actor model with a strong type system, a garbage collector that's free of stop the world (heaps are per actor), and overall performance. They estimate the language features have saved them a year and a half of development.

This post describes how ING is using Apache Flink for their risk engine. In an interesting turn, they use Apache Spark, Knime, and Apache Zeppelin for training models in batch but Flink for the real-time components. They're using PMML data, which is sent via Kafka, to update the Flink application. The architecture lets them deploy a new ML algorithm with zero downtime, instantly. There are lots more details in the post and a link to a video of a presentation on which the post was based.

Conflict Free Replicated Datasets (CRDTs) are one of many mechanisms for service discovery in a distributed system. This post explores several of the common mechanisms for service discovery—DNS, Zookeeper, Etcd, Consul, and CRDT. For an Akka-based architecture, there are a number of benefits of CRDTs (which the article covers).


MapR has announced a new product called Data Science Refinery. It's based on Apache Zeppelin and provides an integrated solution for notebook-based development. It's not yet shipping, but the website has details on what it will support.

Databricks Delta is a new system that provides ACID semantics with support for both streaming and batch processing systems. It's built on Apache Parquet and Amazon S3, and it supports automatic data indexing. Delta is in technical preview with GA targeted for next year.


TileDB is a new database, which is in beta, designed to support dense and sparse multi-dimensional arrays. It's currently integrated with HDFS with other backends and features under development.

Spark 2.1.2 was recently released. There are over 100 bug fixes and improvements in the new version.

Anaconda is a popular distribution of Python packages for folks doing analytics and data engineering. This week, version 5.0 of the anaconda distribution was announced.

Version 0.6.1 of Debezium, which is a tool for implementing change data capture from a relational database, has been released. It supports capturing changes from MySQL, PostgreSQL, and MongoDB and publishing to Apache Kafka using Kafka Connect.

Version 0.21.0 of Pandas is out. Among the highlights is new support for reading Apache Parquet files.


Curated by Datadog ( )



Learn about Stream & Batch Processing with Apache Beam (San Francisco) - Wednesday, November 1


Building In-Stream Processing Engines for Real-Time Analytics with Apache Tools (San Antonio) - Thursday, November 2


Kafka Use Case: Monsanto (Saint Louis) - Wednesday, November 1

New York

Introduction and Roadmap for Apache Arrow (New York) - Wednesday, November 1


Real-Time Analysis of Uber Data Using Apache APIs: Kafka, Spark, and HBase (Burlington) - Thursday, November 2


Impala, IA, Spark, TickSmith, Thales, Pizza and More! (Montreal) - Tuesday, October 31


Using Events, Streams, and Kafka to Support More Autonomy and Innovation (Stockholm) - Wednesday, November 1


MesosCon Recap with Flink, TensorFlow, and Vamp (Berlin) - Wednesday, November 1

Graph Processing with Apache Flink (Hamburg) - Thursday, November 2


Inaugural Data Engineering Meetup (Sydney) - Tuesday, October 31