Data Eng Weekly

Hadoop Weekly Issue #214

30 April 2017

While cloud vendors have been adding impressive features to their SaaS SQL-on-big data engines, other vendors are trying to demonstrate the benefit of their non-hosted products. A couple of weeks ago, Hortonworks was blogging about Hive LLAP and this week Cloudera is touting Impala. Regardless of which camp you're backing, the competition is good for end users.


Amazon Redshift has added new mechanisms to optimize a cluster by by performing certain tasks when a rule (written in their new engine) is violated. For example, they show how to terminate a long running query that is generating too many rows or using too much CPU and not completing after a certain amount of time. The system also has the ability to log details or "hop" a query to a different queue.

Apache Ambari, which is used for managing Hadoop clusters, has a new feature for interacting with Apache Hive in version 2.5. Dubbed "Hive View 2.0," Ambari can now perform most operations against the Hive metastore, including viewing and computing table/column statistics (which are used by the optimizer). Additionally, the new release includes a visualizer for the Hive optimizer (to visualize EXPLAIN output).

With the caveat that it's important to take a critical eye to vendor benchmarks (and try with your own data), Cloudera has posted impressive benchmark numbers for Impala. They've compared to Greenplum and several Hadoop-ecosystem engines (including Spark SQL, Hive with LLAP, and Presto) using data from the TPC-DS kit in both multi user and single user scenarios.

The Hortonworks blog has an overview of Livy, which is a server that exposes HTTP APIs for interacting with a Spark cluster. The post looks at both the programmatic API, and the RESTful endpoints that can be used to run both a batch job and an interactive session. The post also describes Livy's security and high availability features.

The Databricks blog has a tutorial that describes how to use Spark's structured streaming with Apache Kafka as the source (and sink) of data. The post includes a real-world example of processing JSON event data from a Nest device.

Cloudera has two posts on their new Cloudera Data Science Workbench. In the first, they describe how to write a PySpark job that uses a python library (python packaging is different from the JVM's JAR files, which is often a pain point in these types of applications). In the second, they show how to use BigDL (a deep learning library for Apache Spark) with the workbench.

The LinkedIn blog has a post about how they've transformed testing for their Content Analytics system. By building out a number of mock functions and libraries, they're able to write unit tests for the Kafka/Samza jobs that execute in a few minutes. Previously, testing was UI driven and took days or weeks to complete.

The StreamSets Data Collector is using the same expression language (EL) as is supported in JSP, and it's easy to write a custom function as is described in this post.


Apache Metron, which is a security-focused analytics system built on the Apache Hadoop ecosystem, has graduated from the Apache incubator.

Videos of presentations for Flink Forward San Francisco, which took place earlier this month, have been posted.

It's been just over a year since Apache Apex became a top-level project. The post celebrates community growth (number of contributors, pull requests, and others), adoption (including a few examples), the major features released over the past year, and the roadmap for the project looking forward.

Cloudera's IPO was Friday. Shares went for $15 and jumped 20% the first day bringing Cloudera's market cap to $2.3 billion.


Version 0.11.0-incubating of Apache PredictionIO was released. PredictionIO is a framework for building predictive services built on Apache Spark, Apache HBase, Spray, and Elasticsearch.

Version 0.3.0 of Scio, the Scala API for Apache Beam, was released. This is the first non-beta release that is built on the Apache Beam SDK rather than Google Cloud Dataflow SDK.

Apache Kafka was released to fix a number of bugs in the release.

Apache Gearpump (incubating) has released version 0.8.3-incubating. Gearpump is a streaming engine built on the Actor model.


Curated by Datadog ( )



Interacting with Spark: Beyond the Basics (San Diego) - Wednesday, May 3

Hands-On Workshop in Real-Time Analytics, Stream Processing with Twitter Heron (Santa Clara) - Saturday, May 6


Near Real-Time Ingest with StreamSets Data Collector (St. Louis) - Wednesday, May 3


Meetup #9 (Vilnius) - Wednesday, May 3


Scaling Data Pipelines + Master BigQuery and Redshift (Athens) - Wednesday, May 3


An Informal Session with Cloudera and Doug Cutting, the Father of Hadoop (Auckland) - Tuesday, May 2

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit