Data Eng Weekly

Hadoop Weekly Issue #181

31 July 2016

There were lots of milestones achieved this week as Apache Kudu and Apache Twill both graduated from the Apache incubator, and Apache Spark and Apache Mesos both had major releases (2.0 and 1.0 respectively). In technical posts, there's an overview of Uber's tech stack, a comparison of Apache Kafka and Google Cloud Pub/Sub, a guide for running Apache Kafka in AWS, and more.


Uber has published a detailed overview of their technology stack. The article has several sections on infrastructure (e.g. Kafka, ELK stack, Mesos) and data systems (Cassandra and Riak complement MySQL for online systems, Storm is used for real-time processing, and Spark, Hive, & Hadoop are used for batch).

Google Cloud Pub/Sub is a large-scale messaging service that's part of Google Cloud Platform. This post provides a quick comparison of Pub/Sub and Kafka across the topics of operations (e.g. data retention, backups), application development (e.g. APIs, consumption model), cost, and architecture (e.g. ordering guarantees).

MapR has two whiteboard walkthroughs about streaming systems and micro services. The series describes why a high-performance message queue that is omni-present and easy to use is important for decoupling micro services. With a system like this (Kafka or MapR Streams), it's possible to decouple the backend data systems to power stand-alone services.

There's been a lot of innovation in open-source stream processing frameworks over the past year or so. Apache Spark 2.0 (more about the release below) introduces an alpha implementation of Structured Streaming, which aims to jockey Spark to the forefront. Structured Streaming adds new features such as event-time processing and a simplified API that aligns with DataFrames. The Databricks blog has two posts about the new framework—the first motivates Structured Streaming and gives an overview of its architectural guarantees. The second describes the API and data model in more depth.

Confluent has posted an overview of running Apache Kafka in AWS. As someone who searched long and hard for similar advice in the past, this guide is very welcome advice. It covers the major decisions across storage, instance types, network configuration, availability zones, and more.

The AWS Big Data blog has a tutorial for configuring Amazon EMR with the Spark JobServer, which allows for REST-style Spark job submission and monitoring.

The Cloudera blog has a tutorial that demonstrates email archival to HDFS. The article describes building a pipeline with Apache James to act as a email journal recipient that writes mail out to disk, Apache Flume for sending files to Apache Kafka, and Apache Spark for combining and archiving files to HDFS.

Netflix has written about their use of Apache Mesos for services and data processing. Their Mesos cluster uses Mantis for stream processing, Titus for Docker, and Meson for workflow orchestration and scheduling.


Two Hadoop-ecosystem projects graduated from the Apache Incubator this week. Apache Kudu, the data storage system optimized for both scans and random access, is close to a 1.0 release. Apache Twill, which is an abstraction layer for Apache Hadoop YARN, released version 0.7.0-incubating in January.


Apache Spark announced the milestone 2.0.0 release this week. There are a number of changes in the release—highlights include unification of the DataFrame and Dataset APIs, support for SQL2003, substantial speedups, and Structured Streaming (alpha).

Apache Mesos also hit a major milestone with the project's 1.0 release. Highlights of the release include a new HTTP API, support for docker containers without the docker daemon, CPU support, and fine-grained authorization.

Apache Kylin 1.5.3 was release this week. Kylin is an OLAP system for big data. This release adds support for Amazon EMR, improves support for MapReduce job queues, and fixes a number of bugs.

Google Cloud Dataflow has announced beta support for Python via version 0.4.0 of the Cloud Dataflow SDK for Python.


Curated by Datadog ( )



Building a Real-Time Streaming Platform Using Kafka Streams and Kafka Connect (Mountain View) - Wednesday, August 3

Apache Spark Streaming with Apache NiFi and Apache Kafka (Santa Clara) - Wednesday, August 3

Better Together: Fast Data with Apache Spark and Apache Ignite (San Francisco) - Thursday, August 4

Technical Workshop on Spark SQL (Palo Alto) - Friday, August 5


Building a Data Pipeline for Ad Tech Using Spark, Impala, and Zoomdata (Tempe) - Wednesday, August 3


Apache Zeppelin: Meet the Creator (Addison) - Monday, August 1


Hadoop in High-Tech Manufacturing (Saint Paul) - Thursday, August 4


August Orlando Devs Meetup (Orlando) - Thursday, August 4


Solving the Big Data Security Puzzle (Atlanta) - Wednesday, August 3


A View on the World of Streaming: Use Cases, Challenges and Solutions (Munich) - Wednesday, August 3


Scio: Scala API for Apache Beam (Warsaw) - Wednesday, August 3


Apache Spark Meetup (Auckland) - Tuesday, August 2