Data Eng Weekly

Hadoop Weekly Issue #240

12 November 2017

Lots of breadth of topics in this week's issue including KSQL, LinkedIn's data abstraction layer (called Dali), Complex Event Processing with Flink, time series data with Accumulo, and more. There's also coverage of a couple of new tools—Pipewrench and Wobidisco (which is written in OCaml).


IBM has published a new tutorial for using Spark with Elasticsearch to build a recommender system. The code is on github, and the README has describes how to get it up and running. It includes an interactive Jupyter notebook for stepping through the process and a list of troubleshooting steps.

Cypher is a declarative graph query language. The Neo4j team has announced alpha support for Cypher with Apache Spark. There are more details and some examples (including with the zeppelin integration to visualize a graph) in this post.

The Cask blog shows how to use CDAP with several components of the Google Cloud Platform including Google Cloud Storage, Google BigQuery, Google PubSub, and the Google Cloud Speech API.

This post demonstrates a non-trivial streaming program with KSQL. The input is a stream of digital sensor data from a gaming steering wheel, which makes for some interesting challenges (e.g. an event is only generated for the brake when its position changes, so one must calculate how long the break has been applied). At the end of the processing, data is visualized with a Grafana dashboard pulling data from InfluxDB.

Flink supports Complex Event Processing by providing a high-level API to detect patterns of events. The post has an overview of the API and a motivating example from shipment tracking of an online retailer.

Data Access at LinkedIn, or Dali, is an abstraction layer used by both batch and stream processing frameworks. This post gives a brief intro to the framework then does a deep dive into Dali views. Defined as SQL queries, the views are incorporated into a versioned library to be consumed by Hive, Spark SQL, Pig, and other jobs (Dali implements a Record Reader API). This greatly simplify queries on certain tables (e.g. it can project away columns or run data through a UDF) and provides a mechanism to change data formats in a backwards-compatible way.

This post describes how the US National Data Center has proven out a new system built on HDFS, Apache Accumulo, Apache NiFi, and Apache Zeppelin for ingesting time series data. They are currently ingesting nearly 500,000 events/second from 200 sensors.

The Hortonworks blog has a tutorial for setting up a Kerberos-enabled Hadoop and HBase cluster using Ambari.

This post provides a good overview of building a synchronous, transactional system atop of Kafka Streams. The motivating example is a purchasing API in which an Orders come in via one topic and are validated (i.e. that inventory exists) and reserved (added to a reservation stream) in a single transaction. The post also mentions the notion of using KSQL to implement this pattern for non-JVM languages via the so-called sidecar pattern.

This paper describes tools for Bioinformatics workflow engines written in OCaml. The projects, Ketrew, BioKepi, and Coclobas are among a suite of open-source tools (collectively referred to as wobidisco) to interact with YARN and other HPC systems. The libraries are on github.


The Databricks blog has a recap of a few talks form the recent Spark Summit EU. The slides and videos for many of the talks have also been posted (make sure to change the filter at the top to see all of the videos).

Hortonworks beat quarterly earnings estimates and is close to breaking even.

AWS recently announced per-second billing. This post from Databricks talks about why it's a game changer for big data workflows that are running on AWS—wall clock time can be optimized without paying extra cost.


Apache Kylin 2.2.0 is a new major release of the OLAP interface to Apache Hadoop. The new version gains ACL management at the table level through Apache Ranger. There are also a number of improvements and bug fixes.

Pipewrench is a new tool from Cargill for automating data pipelines. It's a bit different than other workflow tools, as its main functions are to (1) generate scripts (sqoop scripts, impala/kudu DDLs, etc) based on table definitions and (2) generate and execute bash scripts via Make to build the dependency graph. There is a great readme for the project at the github repo.


Curated by Datadog ( )



Lambda vs Kappa Architecture and Apache Flink (Salt Lake City) - Tuesday, November 14


Intro to Building a Distributed Pipeline for Real-Time Analysis of Uber (Franklin) - Thursday, November 16


1st Ever SMACK Atlanta Meetup! (Atlanta) - Wednesday, November 15

New York

Turning Hadoop into a High-Performance Enterprise Data Warehouse (New York) - Wednesday, November 15


Sanjay Radia, Original Hadoop Team Member & Hear about Manulife's Hadoop Journey (Toronto) - Monday, November 13

Real-Time Event Communication with Apache Kafka (Waterloo) - Tuesday, November 14

Processing Fast Data with Apache Spark: The Tale of Two APIs (Vancouver) - Thursday, November 16

IRELAND Google BigQuery for Data Warehousing + Raw to Valuable Data Using Spark (Dublin) - Monday, November 13


November Hadoop Users Group (London) - Wednesday, November 15

Kafka for a Banking System and the Polyglot Programmer (London) - Wednesday, November 15


SMACK & Spark (Levallois Perret) - Monday, November 13


Kafka, Stream Processor, and a DB… + Starting Your Engineering Company Is Easy (Dusseldorf) - Thursday, November 16


Workshop: Let's Break Apache Spark (Krakow) - Monday, November 13


First Kafka Meetup: KSQL (Singapore) - Thursday, November 16

Apache Beam and Google Cloud Dataflow (Singapore) - Thursday, November 16


Kafka, KSQL (Sydney) - Monday, November 13