Data Eng Weekly

Hadoop Weekly Issue #193

13 November 2016

Welcome to a double-issue of Hadoop Weekly. There's lots of breadth in this week's issue—from Apache Avro to Apache Spark and everything in between.


The Cloudera blog has a post describing how the new Apache Oozie database migration tool works to maintain job configuration and history during Oozie upgrades.

MapR's latest whiteboard walkthrough covers Apache Drill's query optimizer. Built on Apache Calcite, the optimizer implements rule-based (e.g. projection push-down, partition pruning) optimizations as well as cost-based (to do things like reorder joins).

The morning paper has a recap of a 2015 paper from Databricks about some of the changes they implemented in Spark based on their customers' experience. While there are some things that have been covered elsewhere (e.g. the optimizations), there's also discussion of some internals like their switch to netty and assumptions about HDFS block sizes that I hadn't before come across.

For the distributed systems folks, this is an interesting presentation on Flexible Paxos—i.e. the ability to reach consensus without majorities.

This month's Log Compaction post, which covers news in the Apache Kafka community, has a description of several underway Kafka improvements (including improvements for multi-tenancy), as well as links to posts on Kafka at Walmart, Unit Testing Kafka, and a great explanation of encryption for Kafka messages.

The IBM Hadoop Dev blog has a post highlighting several presentations from the recent World of Watson conference. The speakers covered various themes in healthcare, fraud detection, and marketing.

This presentation gives an introduction to Hivemall, which is a new Apache incubator project for machine learning on Apache Spark, Apache Hive, and Apache Pig. It's been around outside of the ASF for quite some time, though, and it has a fairly impressive feature set. The presentation describes use cases and example syntax for training and prediction.

Apache Avro is a well-supported file format throughout the Hadoop ecosystem due to its compact encoding and support for schema evolution. This post describes how it can be used with Hive, including how to add or remove columns from the Hive definition in a backwards/forwards-compatible way.

Databricks has announced a new documentation resource for their own product as well as for Apache Spark. The Spark materials includes tutorials, a SQL language manual, training materials, and more.

The Hortonworks blog has a post that describes Apache MiNiFi and outlines several use cases. MiNiFi aims to run where data is collected and can be either a C++ or Java agent.

Big Data Labs has a number of interesting Spark tutorials and use cases. This week, there's a new walkthrough on analyzing Capital Bikeshare historical trip data using a number of Spark's machine learning libraries.


This post has a look at the trade-offs and SLAs for Google's various storage and blob storage tiers (such as regional, nearline, and coldline). The author pulls together public details about Google's infrastructure and adds a bit of speculation to talk about how the various tiers are likely implemented.

SearchDataManagement has an article about how several companies are using Apache Spark for use cases ranging from web site personalization to bank analytics. Even in its adolescent state, Spark is gaining pretty wide adoption.

DataStax made some news in the open-source community last week by saying that many of their developers will be focussing on DataStax Enterprise rather than Apache Cassandra.

The Confluent blog has a post describing the history of non-JVM clients for Apache Kafka, the work that was done for simplifying the client protocol (so that clients don't depend on ZooKeeper), and Confluent's progress towards using the C-based client to power other non-JVM languages (like Python and Go).

Hortonworks reported earnings for Q3. They lost $64.7 million on $47.5 million in revenue.

Qubole and Oracle have announced that the Qubole Data Service is now generally available on the Oracle Bare Metal Cloud Service.

Flink Forward is taking place in San Francisco in April. Call for papers opens soon.


Amazon EMR 5.1.0 was recently released, and it's the first version in which Apache Flink is natively supported.

Altiscale has announced that they're supporting ACID transactions for Apache Hive on their Hadoop-as-a-Service platform.

Apache Fluo (incubating) is a system based on Google's Percolator for performing incremental updates on data stored in Apache Accumulo. Version 1.0.0-incubating was recently released.

Version 0.1.0-incubating of Apache S2Graph was released this week. S2Graph is a distributed graph processing system with a REST API, bulk loader, and more. It uses Apache HBase for storage.

Cloudera Labs has announced support for version 0.10.0 of YCSB, the benchmarking tool for NoSQL databases. There are a number of changes including support for Apache Solr, Google Cloud Datastore and Bigtable, and more.

Apache Knox, a REST API Gateway for the Hadoop ecosystem, version 0.10.0 was released this week. The release includes improvements to LDAP, PAM support, and Websocket support.!

Apache Spark 1.6.3, the latest maintenance release in the 1.x family, was announced this week. It contains over 35 bug fixes and a number of improvements.


Curated by Datadog ( )



Big Data Science Meetup (Mountain View) - Monday, November 14

Stream Computing: The Engineer’s Perspective (San Francisco) - Tuesday, November 15

#OCBigData Meetup #20 (Irvine) - Wednesday, November 16

Architecture of an Open Source RDBMS Powered by HBase and Spark (Mountain View) - Wednesday, November 16

Airflow Meetup (Redwood City) - Wednesday, November 16

Pulsar: Distributed Pub-Sub Messaging & Apache NiFi in Action (Mountain View) - Thursday, November 17


Building Recommendation Systems in Python Using Apache Spark (Seattle) - Tuesday, November 15

Security and Machine Learning with Apache Spark (Seattle) - Wednesday, November 16


H2O Sparkling Water on Azure Using HDInsight Spark (Dallas) - Wednesday, November 16

HA Spark Streaming with DataStax Enterprise and Confluent (Houston) - Wednesday, November 16


Apache Kudu with Kudu Founder Todd Lipcon (Saint Paul) - Thursday, November 17


Harnessing Data Within Hadoop in the Connected World (Cincinnati) - Tuesday, November 15

Future of Data: Cincinnati (Cincinnati) - Thursday, November 17


How a Streams-First Architecture Enables Real-Time Big Data (Atlanta) - Wednesday, November 16

North Carolina

November CHUG: Igniting Audience Measurement at Charter (Charlotte) - Wednesday, November 16


Introduction to HDInsight (Vancouver) - Wednesday, November 16


Big Data in AWS (Madrid) - Wednesday, November 16

Beyond Shuffling & Streaming Preview, by Holden Karau (Barcelona) - Thursday, November 17


Typescript & Flow + Apache Spark + Jigsaw (Kiel) - Thursday, November 17


PyData Amsterdam: The H20 Edition (Amsterdam) - Wednesday, November 16


Big Data Meetup: November 2016 (Budapest) - Tuesday, November 15


Operational Analytics Using Spark and Storm (Zagreb) - Tuesday, November 15


The Best of Hadoop Summit 2016 + Screening of Doctor Strange (Tel Aviv-Yafo) - Tuesday, November 15


November Meetup (Brisbane) - Tuesday, November 15