Data Eng Weekly

Hadoop Weekly Issue #234

01 October 2017

It was Strata Data Conference this week, so a lot of announcements and releases were scheduled around that. Among them, two new open-source projects are worth checkout out—Vespa from Yahoo/Oath and Wallaroo from Wallaroo labs. In addition to a bunch of other releases, there are great technical articles on Hortonworks' testing strategy, Spark GraphFrames, and MapR-DB architecture.


These are the first two posts in a series on how Hortonworks builds and tests their distribution. Validating a single change set is a large undertaking, as there can be as many as 30 downstream applications that need to be tested. Running all those unit tests in serial takes 6 hours, so the Hortonworks team is using YARN to run them in a distributed manner (including running YARN in YARN, which they call yinception).

Hortonworks has posted a benchmark comparing performance of Hive on HDP 2.5 vs 2.6. The new version is faster due to improvements in the optimizer, vectorization, and more. The post also includes a contentious comparison to Impala (see the comments for more details). As always, be sure to consider your use case rather than relying on benchmarks from a vendor.

The Bay Area Apache Spark Meetup had a meeting in early September with presentations from Aruba on data correlation using PySpark (including lessons learned in joining data) and Databricks on GraphFrames. The slides and videos from both presentations are on the Databricks website.

If you've been meaning to try out Amazon Athena (which is powered by Presto), there are two tutorials this week that make it easy to get going. The first shows how to analyze AWS cost and usage data, and the second pulls in data from the CDC's Behavioral Risk Factor Surveillance system for analysis. Both tutorials make use of other AWS services such as Lambda and Glue.

The IBM blog has a tale of debugging performance problems that appeared when growing an HBase cluster. The post includes an analysis of the symptoms and the root cause.

MapR has a great technical and architectural comparison of MapR-DB with Apache HBase and Apache Cassandra. The article spends some timing describing the trade-offs of Log Structured Merge (LSM) trees that power HBase and Cassandra, including read and (async) write amplification. MapR-DB leverages the random read/write semantics supported by its file system to implement a hybrid LSM/b-tree indexing strategy. Overall, there are lots of interesting details in the post.

The Confluent blog has a post describing how to use Kafka to power a machine learning application. Kafka is used to store feature data, model params, training data, and more. The pieces of the model building and evaluation pipeline are built with Kafka Connect, KSQL, and Kafka Streams.


Strange Loop was this week in St. Louis. Videos of the presentations, many of which cover topics in distributed systems and data engineering, have been posted on Youtube.

Strata Data Conference was this week in New York. Datanami has a post about how other technologies (and its complexity) are drowning out Hadoop.

Apache Impala (incubating) has announced a CVE that can disclose information. The announcement includes a mitigation strategy until the fix comes out in the next release.


Version 2.7.0 of Luigi was released. It includes fixes and improvements.

Apache Storm 1.0.5 was released with seven bug fixes.

Version 0.7.3 of Apache Zeppelin was released. It contains a number of bug fixes and minor improvements.

Wallaroo Labs has open-sourced their Wallaroo streaming data processing engine. Wallaroo is elastic and scalable, with state management and failure recovery builtin. It is developed in an actor-based, non-JVM language called Pony. Wallaroo source is licensed partially under the Apache License and partially under the Wallaroo Community License that has some restrictions.

Hortonworks has announced a new Hortonworks Dataplane Service aimed at hybrid cloud as well as non-Hadoop workloads. It provides a data services catalog, security controls, and integration with external sources. While it's built on Apache Ranger and Apache Knox, the technical and integration details are still forthcoming. Datanami has some details about the offering based on the press release and interviews.

Oath, the parent company to Yahoo, has open-sourced their big data serving system, Vespa. It powers many websites including, Yahoo News, and Flickr. Unlike some solutions that build complete pre-materialized views for the serving layer, Vespa is a hybrid solution in which the serving layer does distributed calculations over data feeds.

Version 0.4.2 of Scio, the Scala library for Apache Beam, has been released. There are a bunch of bug fixes, library upgrades, and new features.

The fall release of BlueData EPIC includes support for host tags, support for GPU acceleration, additional security features, and support for the Google Cloud Platform and Microsoft Azure. The first post has high-level details on the new features of the big data as a service provisioning engine. The second takes a deeper look at using the GPU acceleration with the BigDL library for deep learning.

MapR-DB 6.0 was released. New features include native secondary indexes, optimized Apache Drill integration, native Spark/Hive connectors, and change data capture.

Cloudera announced that their big data platform as a service offering, Altus, has added support for Microsoft Azure. They have a tutorial for getting started.

Qubole has announced support for the Microsoft Azure Data Lake Store in their Qubole Data Service.


Curated by Datadog ( )



Reference Architecture for In-Stream Processing Service Using Spark Streaming (Houston) - Tuesday, October 3


Making Big Data Easy: Building a Self-Service Data Platform in the Cloud (Oklahoma City) - Thursday, October 5


Kafka Overview (Saint Louis) - Wednesday, October 4


Apache Beam Meetup: Introduction + Use Case + State & Timers (London) - Tuesday, October 3


Big Data & Data Science (Paris) - Monday, October 2

Spark Meetup @ Criteo (Paris) - Tuesday, October 3


Cloudera in the Cloud (Kontich) - Wednesday, October 4


Apache Spark & Co. (Berlin) - Thursday, October 5


Modern Data Lake: MySQL, Mongo, HBase, ­Solr, Hive (Prague) - Tuesday, October 3

Red Hat & Hortonworks: Follow the Open Source Leaders (Brno) - Tuesday, October 3

Open Source Innovations in Hadoop and the Cloud (Prague) - Wednesday, October 4


Big Data and Databases (Szeged) - Tuesday, October 3

What's New in Hadoop 3.0? Meetup with Daniel Templeton (Budapest) - Tuesday, October 3


Introduction to Structured Streaming (Bangalore) - Saturday, October 7