Data Eng Weekly

Hadoop Weekly Issue #236

15 October 2017

Technical highlights in this week's issue include a post on cross-data center replication with Apache Pulsar (incubating) and a new SQL stream processing tool from Uber called AthenaX. There are also a number of benchmarks in this week's issue and several conference-related announcements (videos posted from Big Data LA, CFP for Big Data Technology Warsaw, and a preview of Spark Summit EU).


Hortonworks uses 21,000 compute hours a day to run tests for Hadoop ecosystem projects. The majority of these are nightly automated tests, but there are four testing environments in total. This post describes more about the testing stages.

Fivetran, who builds a product for syncing data into a data warehouse, has written up a benchmark comparing Redshift, Snowflake, and BigQuery. It shows that for a client's typical use case—lots of tables—all the systems behave similarly. While one should always be careful at reading too much into benchmarks... in this case they also provide some commentary on several previous benchmarks to provide details on why their analysis differs.

In another benchmark, the Cloudera blog considers the performance characteristics of the Azure Data Lake Store (ADLS). Ultimately, they find that ADLS is comparable to the Azure disk storage using the Teragen/Terasort/Teravalidate benchmark to measure throughput.

This post describes how to perform cross-data center replication with Apache Pulsar (incubating). It describes the commands needed to setup replication, how to override it on a per-application basis, how to monitor, how to limit replication bandwidth, and more.

After hello world, the "twitter data" tutorial seems to be the next example used for most streaming systems. In this case, Confluent has a demo of analyzing tweets in Kafka using KSQL.

One final benchmark for this week—Databricks has compared Spark streaming to Flink and Kafka Streams.

Qubole has a post that shows how to use PyHive from a Jupyter notebook to run queries against a Hive or Presto cluster. While parts of the post are Qubole specific, the instructions should work as long as the Hive Server 2 is enabled on your cluster.

This tutorial is a tour of both tools and machine learning models for ultimately predicting whether a song will make the Billboard top 10. On the tools front—RStudio,, and Amazon Athena are used for analysis using GLM, GMB, and deep learning models. The walkthrough includes a bunch of code and example output.

Kontena has a post on using Kafka with microservices. It provides a good overview of how Kafka's distributed log semantics and data retention policies can power event sourcing microservices.


Videos from Big Data LA, which took place in August, have been posted online.

The Roaring Elephant podcast has an episode this week that recaps a number of sessions from the recent Dataworks Summit in Sydney. Topics covered include Hadoop 3.0, SparkR and Kerberos.

The Call for Papers for Big Data Technology Warsaw, which takes place in February, ends tomorrow. This post has more about what to expect from the conference.

Spark Summit EU takes place in just over a week in Dublin. The Databricks blog has more details and a discount code if you haven't yet signed up.

A CVE for Apache NiFi was announced. Some versions prior to 1.4.0 (released earlier this month) are subject to a XML External Entity attack.

A CVE for Apache ZooKeeper was also announced. In this case, certain commands can cause a denial of service. More details about affected versions and mitigation are in the announcement.


Uber has open-sourced their stream processing framework AthenaX, which is allows users to write stream jobs in SQL. The announcement describes the motivation and the high-level architecture of AthenaX, which is built on Apache Flink and Apache Kafka. It also uses Apache YARN for deployment. Uber has over 220 applications in multiple data centers running via AthenaX.

Apache Phoenix 4.12 was released. It includes resolution of over 100 issues, a new approximate count distinct function, a new table sampling feature, improvements to secondary indexing, and more.

Cloudera Director 2.6 was released. Among all the changes, it includes new TLS capability / other improvements to SSH key and improvements to the Microsoft Azure plugin.

Gluon is a new machine learning library from AWS and Microsoft. It's available today as part of Apache MXNet.


Curated by Datadog ( )



Tensorflow on Apache Hadoop YARN (San Francisco) - Wednesday, October 18

Apache Ignite: Building Consistent, HA Distributed Systems & Memory-Centric SQL (San Ramon) - Thursday, October 19

The Evolving Landscape of Data Engineering & How Systems Fail (San Francisco) - Thursday, October 19


Spark MLlib (Salt Lake City) - Monday, October 16


KSQL and Data Management with Kafka (Minneapolis) - Wednesday, October 18


Leveraging Messaging Platforms Such as Kafka for Real-Time Streaming Transaction (Atlanta) - Thursday, October 19

North Carolina

Real-World Deployments with Apache Kafka (Raleigh) - Tuesday, October 17


Spring for Apache Kafka (Richmond) - Wednesday, October 18

New York

Exactly-Once Semantics in Apache Kafka (New York) - Monday, October 16

IRELAND Let's Talk Kafka Streams (Dublin) - Thursday, October 19


Stream SQL and Real-Time Applications with Apache Flink (London) - Wednesday, October 18


Big Data Analytics for Small- and Medium-Size Enterprises (Oslo) - Tuesday, October 17


Helsinki Apache Kafka Meetup October (Helsinki) - Wednesday, October 18


Stream SQL and Real-Time Applications with Apache Flink (Neuilly-Sur-Seine) - Thursday, October 19


Stream Processing and War Stories (Berlin) - Thursday, October 19


Apache Kafka Streams: Building Distributed, Fault-Tolerant Processing Apps (Tel Aviv-Yafo) - Wednesday, October 18