Data Eng Weekly

Hadoop Weekly Issue #192

30 October 2016

This week's issue has posts on new features in several recently released projects—Apache Kafka, Apache Ambari, and Apache NiFi. In news, there's a recap of the recent Spark Summit EU and a new partnership between Qubole and IBM.


The Confluent blog has an overview of interactive queries—a new feature in Apache Kafka 0.10.1. By exposing a new set of APIs, servers participating in a Kafka Streaming application can now serve interactive queries based on their local state. This feature can great simplify the process of serving up the results of a streaming application—an intermediate data store like HBase, Redis, or Cassandra is no longer needed.

This presentation gives a thorough overview of Apache Ambari, which is a tool for managing Hadoop (and much more) clusters. Topics covered include metrics, alerts, log search, security setup, and cluster upgrades.

This article on DZone describes several open-source Online Analytical Processing (OLAP) systems—Apache Kylin, Druid, and Apache Lens. There are a number of links to background reading on each.

Whenever I post a benchmark, I caveat that the results aren't generally applicable. In fact, two vendors can often paint their own system as the best under similar constraints. Amazon Redshift and Google BigQuery seem to be undergoing one such horserace right now. In this benchmark, Redshift is shown to be 6x faster than BigQuery on a TPC-DS workload.

One of the biggest limitations in distributed databases is the reliance on a primary key to distribute data. While some systems have come up with solutions to handle queries on other data elements, Replex is a fresh take on how to efficiently query on other columns. As usual, the morning paper has a great overview of the highlights of this USENIX 2016 best paper awardee.

Hortonworks recently ran a webinar on new features in Hortonworks Dataflow 2.0 (built on Apache NiFi 1.0). They've posted a summary of major highlights, the slides from the webinar, and several answers to questions from the audience. The post is a great way to get familiar with the main features of NiFi.

pgpool is a reverse proxy for PostgreSQL that is compatible with Amazon Redshift. This post describes how to use its caching features to reduce latency of repeated queries to a Redshift cluster by taking advantage of AWS ElasticCache.

MapR has a tutorial describing how to get started with the MapR Sandbox by launching it on a Microsoft Azure VM. It looks like a great way to get familiar with Azure, MapR, or both.


This post introduces the notion of "lean big data" to make a smart investment in big data technologies, and it describes five common pitfalls that can lead to a failed project. These include deploying big data tech when you don't have big data, separating application and platform roles too soon, and building without a use case in mind.

Spark Summit EU was this week in Brussels. The Databricks blog has a recap of keynotes and technical presentations from the Databricks team. The major themes seem to be real-time processing and advanced data science (such as deep learning). The post also mentions that videos from the conference are expected to be published online this week.

IBM and Qubole announced a partnership in which the Watson Data Platform can leverage the Quoble Data Service for public cloud compute and Apache Spark.


Databricks announced that they're adding support for Machine Learning with GPUs. Compared to a pure Spark/scala code, running TensorFlow on GPUs via Databricks can save time and money.

Qubole announced that its support for the Airflow workflow engine is now generally available.

Apache Bahir, which is a library providing extensions for Apache Spark, announced release 2.0.1. Built to run on Spark 2.0.1, it adds support for Akka, MQTT, Twitter, and ZeroMQ streaming.

Version 1.9.0 of Apache Parquet MR was released. It includes a number of bug fixes and improvements. There are also some small new features, such as support for delta encoding of 64 bit integers.

Amazon EMR 5.0.3 was released with updates to Hadoop, Presto, and Spark.


Curated by Datadog ( )



Stream Processing Meetup at LinkedIn (Sunnyvale) - Wednesday, November 2

Ingest at Intuit + An Intro to StreamSets Dataflow Performance Manager (Mountain View) - Thursday, November 3


Integrating Real-Time Data Streams with Spark and Kafka (Centennial) - Tuesday, November 1


Hands-On Intro to Apache Spark for Data Engineers, Data Scientist, and Developers (Farmers Branch) - Tuesday, November 1

SQL on Hadoop Meetup: Open Source Presto Query Engine (Austin) - Wednesday, November 2


Meet with Scott Gnau, CTO of Hortonworks, on the Future of Data (London) - Tuesday, November 1

November HUGUK Meetup (London) - Thursday, November 3

Apache Flink: State of the Union and What's Next (London) - Thursday, November 3

29th Big Data London Meet-Up at Big Data LDN (London) - Thursday, November 3


Big Data, No Fluff: Let’s Get Started with Hadoop #10 (Oslo) - Thursday, November 3


Spark Meetup with Databricks, Criteo, Qucit, Talend (Paris) - Wednesday, November 2


Spark Streaming + Testable ETL (Vilnius) - Wednesday, November 2


Cloudera Sessions (Dubai) - Tuesday, November 1


Flink Meetup (Hangzhou) - Saturday, November 5


Rethink SQL for Big Data with Apache Drill (Sydney) - Thursday, November 3