Data Eng Weekly

Hadoop Weekly Issue #195

04 December 2016

After a short week due to Thanksgiving in the US, this week's issue has a bit more content than usual. Highlights include a peak inside the data platform at Spotify, content of an intro to distributed systems course, the announcement of two Kafka Summits for 2017, and a new serverless SQL engine on Amazon Web Services based on Presto.


This presentation describes how and why Spotify has moved their data platform to the Google Cloud, replacing Hadoop/Hive with BigQuery, Kafka with Cloud Pub/Sub, and Storm/MapReduce with Dataflow.

This tutorial shows how to use Apache Ambari and Apache NiFi to configure data ingestion via the light-weight MiNiFi process.

Stripe uses Jupyter Notebooks to make analysis reproducible across environments and teammates. This post describes how the setup works and some of the git-based tooling under the hood to implement notebook sharing.

Apache EMR has supported Apache HBase for some time. Recently, though, it added support for running HBase with storage (HFiles) on Amazon S3. Decoupling storage and compute has a number of advantages for cost, elasticity, and more. This post describes FINRA's transition to running HBase with S3.

The dataArtisans blog has a post debunking common stream processing myths, including "Latency and Throughput: Choose One," "Micro-batching means better throughput," and "Exactly once? Completely impossible."

Apache NiFi's UI-based configuration makes getting started really easy, but it could feel like a limitation when it comes to versioning flows across environments. To address that and other issues, NiFi supports templates, configurations, and an expression language. There are more details about how to take advantage of these features in a post on the Hortonworks blog.

This presentation has the ambitious goal of making the "case for stateful stream processing as a general framework for building data-centric systems." There are some clever insights in the post, such as the fact that stateful stream processing "is about creating materialised views." In addition to that topic, there's a thorough overview of Apache Kafka.

The MapR blog has a tutorial of running Spark's k-means clustering from Apache Zeppelin to cluster Uber customer trip data.

This post describes how to setup Amazon EMR with Apache Ranger for role-based access control. Ranger uses the Amazon Directory Service for user and group information, and it enforces authoriation as well as records audit logs in S3.

Kyle Kingsbury, who is the author of the Jepsen software library and blog series describing the safety of various distributed systems, teaches an introduction to distributed systems course. The class curriculum is on github. The content covers lots of ground, like why to use TCP, ACID isolation levels, consensus, backpressure, various production concerns, and more.


The Ampool blog has been posting about "Emerging Data Architectures." The latest post introduces the primitives of the "butterfly architecture," which is an alternative to the Kappa and Lambda architectures.

Confluent has announce two Kafka Summits for 2017. The first takes place in New York City in May and the second takes place in San Francisco in August. The call for papers for NYC ends on January 16th.


Trapezium is an open-source Spark/Akka-based framework for batch and streaming from Verizon.

Amazon Web Services has announced an S3-based data lake solution for their cloud platform. This post describes how to ingest and analyze data in a AWS data lake, which is built on Cognito, API Gateway, Lambda, DynamoDB, Elasticsearch, and more.

Qubole has announced support for heterogenous clusters in its AWS big data as a service platform.

Splice Machine has released version 2.5 of their RDBMS built on Hadoop and Spark. The release includes support for Columnar External Tables, in-memory caching, sketch-based statistics, and cost-optimized storage.

AWS has announced Amazon Athena, which is a serverless service for querying data in Amazon S3. Athena is based on Presto, and supports columnar storage formats like Apache Parquet. This post has a walkthrough for getting started with Athena.

StreamSets Data Collector adds support for Azure Data Lake Store, Google Big Table, Salesforce, change data capture for MySQL, and Kudu. There's also a new event framework, support for running spark jobs during a pipeline, and more.

Version 1.9.0 of Apache Drill, the SQL engine for Hadoop and more, was released. It includes a new asynchronous Parquet reader, Parquet filter pushdown, and more.

Apache Kylin, which is an OLAP engine for Hadoop and other big data systems, released version 1.6.0 this week. The release improves support for Apache Kafka, adds support for Hive's beeline, and includes dozens of bug fixes and improvements.

Apache Orc 1.2.2 was released with new support for lzo and lz4 compression, a new java tool module, the ability to evolve schemas based on field name, and more.

Version 2.4.0 of Luigi, the big data workflow engine, has been released. It includes, among other things, improvements to Luigi's BigQuery integration.


Curated by Datadog ( )



Get Started with Spark & Hive on the AWS Cloud (Santa Clara) - Wednesday, December 7

Meetup @ SpliceMachine (San Francisco) - Thursday, December 8


Scalable Data Science in R and Spark Streaming (Bellevue) - Wednesday, December 7

R+ at Scale, Google & Apache Beam (Seattle) - Wednesday, December 7


Confluent Platform: Imagine Streaming Data Made Easy (Tempe) - Wednesday, December 7


Introduction to Apache Spark with Databricks (Houston) - Thursday, December 8


Hadoop in the Cloud (Saint Louis) - Wednesday, December 7


Winter Scala @ Home Depot (Atlanta) - Wednesday, December 7

Building Event Data Pipelines with Kafka and Hadoop (Atlanta) - Thursday, December 8

New York

Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi (Bethpage) - Monday, December 5

Performance Improvements to Spark 2.0: 20x Sql Speedup Techniques (New York) - Wednesday, December 7

Streaming Analytics in a Flash, Presented by Cask Data (New York) - Thursday, December 8


Connecting All Things with Apache Kafka (Vancouver) - Thursday, December 8


Meetup #1 (Villeneuve d’Ascq) - Thursday, December 8


Making the Elephant Fly: Using Hadoop in the Azure Cloud (Amsterdam) - Tuesday, December 6

Kafka Streaming (Utrecht) - Thursday, December 8


"Big Data with Big Hearts" Lightning Talks (Warsaw) - Monday, December 5

Data Ingestion (Lodz) - Tuesday, December 6


Introduction to Hadoop and Spark (Kochi) - Saturday, December 10

Interactive Data Analysis with Spark Streaming (Bangalore) - Saturday, December 10


PyData SG @ Strata + Hadoop World 2016! (Singapore) - Tuesday, December 6

ODPi Meetup at Strata Singapore (Singapore) - Wednesday, December 7

BigDataSG at Strata/Hadoop Singapore (Singapore) - Thursday, December 8