Data Eng Weekly

Hadoop Weekly Issue #226

31 July 2017

More on data science than usual in this issue (some great articles!) as well as interesting posts on Flink, Kafka, and more. Also, Apache Fluo is now a Apache top-level project, Qubole has a new service for getting data from Kafka to Amazon S3, and Red Data Tools is a new project for bringing data libraries to the Ruby ecosystem using Apache Arrow.


This post gives a walkthrough of how to configure the Cloudera Data Science Workbench to use TensorFlow with Keras to build deep learning models using GPUs. The are many more details about how it works, including an example of predicting successive words.

VariantSpark RF is a library for speeding up random forests in Spark. It was built for bioinformatics use cases by the team at Australia's CSIRO, so the post includes a few genomics examples.

A good introduction to Flink's Complex Event Processing (CEP) APIs (there are both slides and a video). There are illustrated examples of CEP in Flink as well as a description of how Flink implements the state processing with Nondeterministic Finite Automata.

Apache Kafka's new support for exactly-once semantics and transactions enables some interesting new use cases. The latest post in a series on using Kafka to enable event-based services looks at how these new features can simplify event-based systems. The built-in failure and retry handling provide a new level of abstraction that let's developers focus on the core business logic of the application.

Anaconda is a popular mechanism for setting up a python data environment. This post describes how to build a Amazon EMR cluster with Anaconda and run a PySpark job using Oozie.

Data Artisans has written about the "benchmarks" they think are important, including: fault tolerance, friendliness of APIs, support for SQL, and flexibility of deployment options.


Apache Fluo, the system for providing incremental processing of data sets stored in Apache Accumulo, has been promoted to a top-level project in the Apache Software Foundation. Fluo is based on Google's Percolator, which is used for keeping the Google Search Index up to date.

Red Data Tools is a new project that aims to bring data processing tools to the Ruby ecosystem. Red Data Tools are built using Apache Arrow.


Qubole has announced a private beta of StreamX, a new managed service for copying data out of an Apache Kafka cluster for persisting in Amazon S3.

Apache Ignite 2.1 was released. The major feature of this new version of the in-memory distributed data grid is persistence via a new Persistent Store that provides in-memory speeds but maintains durability on disk.


Curated by Datadog ( )



AI for Physically Embodied Systems + Intro to Apache Quickstep (Mountain View) - Tuesday, August 1


Operationalizing Data Pipelines with StreamSets (Murray) - Thursday, August 3


Spark Talks: Overview of Apache Spark (Austin) - Wednesday, August 2


Building a Spark Application from Start to Finish (Saint Louis) - Wednesday, August 2


Big Data Solutions in Azure (Chicago) - Tuesday, August 1


August Presentation Night (Cambridge) - Thursday, August 3


Mesos, Vamp, Marathon, SMACK, and Beyond (Hamburg) - Tuesday, August 1


Automating Movement of Data with AWS Data Pipeline (Mumbai) - Friday, August 4


Architecting a Recommendation Engine with Scala & Spark (Singapore) - Thursday, August 3

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit