Data Eng Weekly

Hadoop Weekly Issue #194

20 November 2016

Thanksgiving is this week in the US, so I've collected quite a few technical and news articles to read if you have downtime. Hortonworks also announced their "HDCloud" offering for AWS this week, and there's a great presentation detailing the work that was done to make Hadoop work better with S3.


The Paypal engineering blog has a two-part post describing their adoption of Apache Kafka and Apache Spark. The first part describes how they got from their old architecture to a new, Kafka-based one. The second part describes their lambda architecture based on Spark streaming on YARN.

This post is a follow-up to recent benchmarks that Yahoo and others performed to compare throughput and latency of several stream processing tools. In it, the DataTorrent team compares Apache Flink and Apache Apex. It shows Apex to have lower latency than Flink, but the main takeaway is that we need better benchmarks—there are many more use cases to take into consideration, and it seems possible to push these systems even harder than we have in previous tests.

This presentation from ApacheCon describes the work that was done towards the goal of making the Hortonworks Data Platform faster on AWS (with data in the S3 object store) than Amazon EMR. The slides include an introduction to object store semantics, some of the improvements that were made to make queries to S3 faster, benchmarks showing a 2.5x speedup over a version without the enhancements, future plans for s3guard to provide consistent metadata access of data in S3, and more.

Databricks has a post describing work to win the CloudSort benchmark, which is a measure of the cost required to sort 100TB of data. The post has an overview of the evolution of sorting benchmarks over the last several years (from DataMation Sort to TeraSort to GraySort) and a description of the record setting setup.

MapR is hosting a free (behind an email-wall) eBook version of "Introduction to Apache Flink" by Ellen Friedman (MapR) and Kostas Tzoumas (data Artisans). The book has six chapters covering streaming, stateful computation, batch and more.

This post provides an overview (by way of example) of 30 command-line tools from Apache Hadoop HDFS.

In addition to Spark Streaming, Paypal is using Apache Storm for real-time computation. This post describes how the carrier payments team uses Storm to read data from Kafka and write it out to Apache Hive.

Amazon Kinesis, which is often compared to Apache Kafka, has strict SLAs per shard (at most 1000 records or 1MB per second). In the case where this is exceeded, Amazon now offers a UpdateShardCount API call. This post describes how to trigger an update from CloudWatch using a Lambda function.

This tutorial, aimed a data analysts that don't yet have exposure to HDFS, describes how to use Apache Sqoop to export data from a PostgreSQL database to HDFS.

The Databricks blog has a tutorial that shows how to use Amazon Kinesis, Amazon Redshift, and Databricks (Spark) to model asset levels. Among other things, the post demonstrates Spark's linear regression library to fit a model to the sample data.

Amazon EMR now supports auto scaling to dynamically grow or shrink a cluster based on Cloudwatch metrics. This post has more details about how it works and how to setup auto scaling rules.

This post, which summarizes a recent paper "A Critique of the CAP Theorem" by Martin Kleppmann, provides an excellent introduction to the CAP theorem and linearizability. There are also several links for further reading.

This presentation from the recent Apache Kafka Bay Area meetup has a good overview of Kafka Streams. In particular, there are some good slides on KTable and the new interactive query support recently added to Kafka.


At ApacheCon this week, the ODPi announced version 2.0 of the runtime specification (based on Apache Hadoop 2.7) and the first version of the operations specification. For those that aren't familiar, the ODPi aims to define specifications to provide compatibility and compliance guidelines (see the press release for more details) for Hadoop distributions.

MapR has a multi-part blog series with an overview and excerpts from "BI & Analytics on a Data Lake" by Sameer Nori and Jim Scott. Part two covers the differences between data warehouses, MPP data warehouses, and Hadoop.

This post has a good introduction to AtScale, which is a middleware that sits between BI tools and Hadoop. This week they announced support for Teradata DW, Google Dataproc, and BigQuery.

Qubole announced this week that you can now get their Big Data-as-a-Service system through AWS Marketplace SaaS subscriptions. This makes their service available to folks with an AWS account without requiring a separate bill from Qubole.

CIOReview has a post by Hortonworks Founder and VP of Engineering Arun C Murthy. In it, he talks about how open-source is changing the governance and security model, machine/deep learning, and cloud. He also has some advice for a company working with open-source.

In this post, the author argues that Google BigQuery and other cloud services will do to big data what Gmail did to email (i.e. "unlimited" storage and a very easy to use system).


Apache Spark 2.0.2, a maintenance fix for the 2.0 branch, was released this week. Notably, support for Apache Kafka 0.10 was added to structured streaming.

The Hortonworks Data Cloud is now available on AWS marketplace.


Curated by Datadog ( )


New York

Two Talks: DSE Graph & Event Streaming with Kinesis (New York) - Tuesday, November 22


Integrating Real-Time Data Streams with Spark and Kafka (Montreal) - Tuesday, November 22


Hadoop, Big Data, and Watson (Buenos Aires) - Monday, November 21


Spark in Production at IBM (London) - Wednesday, November 23


Hands on Hadoop (Oslo) - Wednesday, November 23


Stockholm Erlang Meetup (Stockholm) - Wednesday, November 23

Big Data and Docker (Malmo) - Thursday, November 24


WebTechNight Karlsruhe (Karlsruhe) - Tuesday, November 22


HBase and Elasticsearch at Socialbakers (Prague) - Thursday, November 24


Big Data Ecosystem and Apache Spark Architecture (Ankara) - Thursday, November 24


Elastic Is in Town Again! (Tel Aviv-Yafo) - Thursday, November 24


Spark Meetup November (Sydney) - Tuesday, November 22

Rethink SQL for Big Data with Apache Drill (Pyrmont) - Thursday, November 24