Data Eng Weekly

Hadoop Weekly Issue #224

16 July 2017

With summer taking off, the volume of content has slowed down a bit (please reach out if you do find something I should include!). This week's issue is again a short one. The highlights include new releases of Apache Spark, IBM BigSQL, and Cloudera Enterprise. In addition, there's a new implementation of Kafka Streams in Python, the schedule for Flink Forward is posted, and there are several technical posts covering Spark, Kafka, Kinesis, StreamSets, and HBase.


Although you should keep in mind that these benchmarks are from a vendor and don't necessarily extrapolate to other workloads, Databricks has shown that their platform is faster than vanilla Spark, Presto, and Impala.

This post compares and contrasts Apache Kafka and Amazon Kinesis. Since Kinesis is a SaaS product, it compares favorably in terms of operational complexity. With that said, it has more limitations on throughput (but at least it's predictable), and the two share similar sets of high-level APIs for moving data and doing analysis (e.g. Kafka Connect/Kinesis Firehose and Kafka Streams/Kinesis Analytics). The post also discusses architectural and pricing considerations.

In order to improve their product and provide operational support, Qubole collects metrics and data from a number of data sources across its product suite of Hive, YARN, and Presto. Collected data include events and metrics from the optimizer, the execution engine, the storage layer, and the cluster management layer. They load all of this data into QDS for analysis using a combination of Sqoop, Kafka, and a custom-built metrics service. The post has a few more details about the data they collect and what kinds of things they do with it.

The StreamSets blog has a post on deploying the StreamSets Data Collector (SDC) with Kubernetes. SDC can be deployed as a stateful application or it can rely on a cloud service (Dataflow Performance Manager) for storying state, so the example is a bit more interesting than some of the common Kubernetes tutorials out there.

This tutorial walks through loading the openFDA dataset into Spark, transforming it to Parquet files in S3, querying it with Amazon Athena, and using R and Python for additional analysis.

The Apache Software Foundation blog has a post describing several HBase application archetypes. It details four different use cases, including storage of documents, graphs, queues, and metrics. For each, there is a example or two of how the schema/structure can be defined as well as a reference out to an application that is designed to use HBase in this way.


Two CVEs were announced for Apache Impala (incubating). The mitigation for both is to update to version 2.9.0.

The sessions for Flink Forward, which takes place in Berlin this September, have been posted.

Software Engineering Daily has an interview with Confluent co-founder and CTO Neha Narkhede. If you aren't interesting in the podcast, there's also a PDF transcript.


Winton Kafka Streams is a python implementation of the Apache Kafka Streams API. It's a new project that doesn't yet have a release, and it requires Python 3.6. There are a few examples in the repo that help demonstrate the API.

Apache Spark 2.2.0 was released. Among the highlights, PySpark can now be installed in a single command from PyPI. The release notes have lots of further details, but some highlights include a new cost-based optimizer and the GA of Structured Streaming.

Cloudera Enterprise 5.12 has been released with updates to the core platform, the data science workbench, Kudu, and more.

IBM has announced Big SQL 5.0, which is the first version to be tightly integrated with Hortonworks HDP.

Hue 4.0 has been released with a new interface that revamps the editors and system browsers.


Curated by Datadog ( )



Self-Service Data Integration, Kodiak Data, and Kubernetes! (Palo Alto) - Wednesday, July 19


Sensors, Spark, and Kafka: Applied Machine Learning (Minnetonka) - Wednesday, July 19


Data Streaming Panel Event (Chicago) - Thursday, July 20

North Carolina

SQL on Hadoop and Modern Analytic Databases with Ian Cook (Chapel Hill) - Tuesday, July 18


DATA Rave Party: Spark Edition (Sao Paulo) - Thursday, July 20


Applied Data Engineering #1 (London) - Wednesday, July 19


The State of Spark and Hive in the Cloud, by Nico Poggi (Barcelona) - Thursday, July 20


July Meetup: Scala, Spark, Docker, Elasticsearch (Bucharest) - Thursday, July 20


Big Data Processing on AWS (Tel Aviv-Yafo) - Tuesday, July 18

Kafka Operations (Herzeliyya) - Wednesday, July 19

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit