Data Eng Weekly

Hadoop Weekly Issue #199

08 January 2017

This week's edition is relatively short, but contains some great posts on Hadoop+S3, Hadoop+Python, first-class asynchronous processing with Samza, and two new tools for Kafka and ZooKeeper. There are also a couple of year end/ahead posts from Datanami and Databricks plus a great interview with one of the authors of Cassandra, the Definitive Guide.


This post has a list of important settings and practices for using Amazon S3 with Apache Hadoop and Apache Spark. Following these tips should improve performance and work around some of the problems associated with S3's blob store semantics.

Python data tool support for interacting with Hadoop and the larger ecosystem has drastically improved in 2016. This first post describes the strides made (and plans for 2017) with Apache Arrow, Apache Parquet, the Feather file format, PySpark, and Ibis. The second post looks at the performance and maturity of several python libraries for reading data from HDFS.

MapR has the second part of a blog post that walks through using Spark's k-means machine learning algorithm to do real-time clustering of Uber data. The first post focussed on model creation, and this post adds a Spark Streaming job to apply the classifications and then a second job to produce a dataframe for analysis using Spark SQL.

While much of the recent news related to stream processing has recently focussed on Spark, Flink, and Kafka, the Apache Samza project continues to be used by LinkedIn and other companies. While these other stream processing systems simplify the programming model to be synchronous and stream/event-based, Samza is experimenting with a different, asynchronous model. In it, callbacks are used to efficiently support RPCs and other asynchronous operations. To support these semantics, Samza has implemented an event loop, which is described in this post on LinkedIn's engineering blog.

MapR has a tutorial that describes using Spark to run images through the Tesseract open-source OCR engine and storing the parsed text in an ElasticSearch index.

The morning paper is going to be covering some great distributed systems papers (including Apache Hadoop YARN) this week. In preparation, this post has links to several pieces of background reading from previous posts.


Datanami has an article containing 2017 outlooks from a number of big data industry executives. There are quite a variety of opinions, including that Hadoop will take off (and die off) and that 2017 is the year that BI analytics will finally deliver.

The Databricks blog summarizes some of the major accomplishments and milestones that Spark and Databricks hit in 2016. These include support for SQL-2003, the CloudSort Record, and Structured Streaming.

Confluent has their monthly Log Compaction newsletter that includes coverage of current Kafka Improvement Proposals (including proposals for global tables and single message transformations in Kafka Connect) and several hand-picked articles and presentations.

This post describes how Google Cloud Platforms's per-minute billing and fast boot times allow you to build a job-first data pipeline, rather than a cluster first one. While other cloud vendors offer similar setups (Amazon EMR is the most notable one), this article highlights some of the competitive advantages (i.e. fast ssds, cheap preemptive vms) that Google offers.

After six years, there's a new edition of Cassandra, the Definitive Guide. InfoQ has an interview with the book's co-author Jeff Carpenter about what's new in the book (it covers up through Cassandra 3.0), some of the new features in recent Cassandra releases, Cassandra's multi-datacenter support, integration with Spark and other ecosystem projects, and more

The Call For Papers for Kafka Summit New York, which takes place in May, closes in just over a week. The conference tracks are Systems, Streaming Data Pipelines, and Stream Processing.


For Apache Kafka cluster operations, this project provides a script to analyze cluster state to determine which brokers may be responsible for under-replicated partitions.

Burry is a new tool for performing backups (and restores) of Apache ZooKeeper, etcd, and Consul to local, blob storage (such as Amazon S3), and more.


Curated by Datadog ( )



Spark SQL: 10 Things You Need to Know (San Diego) - Tuesday, January 10

Apache Spark Meetup @ Workday (San Francisco) - Tuesday, January 10

DevOps for Data Science: Lifecycle of Big Data Analytics Services (San Francisco) - Wednesday, January 11

Airflow Meetup 1Q17 (San Francisco) - Wednesday, January 11

2017 Kickoff: Cloudera Lightning Talks (Palo Alto) - Wednesday, January 11

Tech Talk: Processing IoT Data with Apache Kafka (Mountain View) - Thursday, January 12


The Apache Solr Smart Data Ecosystem (Plano) - Monday, January 9

A Brief Introduction to Scala (San Antonio) - Tuesday, January 10


Join Doug Cutting, the Creator of Hadoop, for Apache Hadoop: The Next 10 Years (Saint Paul) - Monday, January 9


Lambda Architecture and Data Mining! (Grand Rapids) - Wednesday, January 11


Introduction to Kafka Streams with a Real-Life Example (Tysons) - Wednesday, January 11


Data as a Log + Asana Live Demo (Zapopan) - Wednesday, January 11


Splice Machine: Architecture of an Open Source RDBMS Powered by HBase and Spark (Barcelona) - Thursday, January 12