Data Eng Weekly

Hadoop Weekly Issue #203

05 February 2017

Streaming is a hot topic this week with posts on Kafka Streams, Oracle Data Integrator, StreamSets Collector, and Amazon Kinesis Analytics. Also, there are great posts on performance of various data formats/systems and integrating Spark with Kudu.


Sky Betting and Gaming has written about their streaming infrastructure. Not long ago, it was built on Spark streaming and Drools. Recently, they moved to Kafka Connect and Kafka Streams. The streaming applications in their use case require data sources outside of Kafka (such as lookup data in HBase). As is noted, Kafka Streams doesn't support data sources other than Kafka, so they had to write a custom integration using Akka.

Cern has published a performance comparison of Apache Avro, Apache Parquet, Apache HBase, and Apache Kudu for querying and analyzing the ATLAS EventIndex of collisions done on the Large Hadron Collider. The post describes space utilization, ingestion rate, random lookup latency, data scan rates, and provides a number of lessons learned. If you're considering a similar use case or any of these systems, this post provides a lot to chew on.

This post describes how to hookup HBase metrics to Prometheus, the open-source monitoring system with Grafana integration. Metrics are exported by way of the prometheus JMX exporter, which runs as a Java Agent and is configured via a simple YAML file.

Qubole continues to innovate in the Hadoop in the cloud space. This time, they've added the ability to dynamically grow the size of HDFS without adding more nodes by utilizing EBS volumes and the Linux Logical Volume Manager. If you're running HDFS in the cloud, replicating this setup is likely a good way to keep cost down on storage-limited workloads.

The Amazon Big Data blog has a tutorial describing how to configure an Amazon EMR cluster for encryption in transit (to/from S3 and during MapReduce shuffle) and at rest (in S3 and on local disk). Much of the work to do this is related to configuring encryption keys, which is done using the Amazon Key Management Service.

The Cloudera blog has a walkthrough demonstrating the integration between Apache Kudu and Apache Spark. There are a number of code snippets (written in Scala) demonstrating the DataFrame integration (which includes support for inserts, upserts, and updates), the native Kudu RDD, and more. Of note, the integration includes support for Kudu's predicate pushdown via the DataFrame APIs.

Doing a data migration of a real-time system with zero downtime can be a tricky. This post describes how Stripe has done data migrations (including great visualizations of data flows). To reduce load on production systems, the bulk queries to migrate data are done via a MapReduce job (written in scalding) rather than executed directly against the production database.

This article gives an example of extending Spark MLlib to add additional stages to a Spark ML Pipeline. The post gets into the specifics of the ML pipeline APIs, describes some practical considerations (like configuration params and caching for iterative algorithms), and has quite a bit of sample code.

Using PySpark with an dataset of NBA player statistics, this post gives a great introduction to several features of Spark as well as integration with Pandas and matplotlib.

This post walks through configuring the Oracle Data Integrator with Apache Kafka/MapR streams to capture changes made to a MySQL database as a stream. Configuration is done with a combination of config files and configuration UI.

On the topic of database change capture, StreamSets supports a polling-based capture of streaming data using JDBC. This post walks through capturing changes in a MySQL database and streaming (removing certain PII fields along the way) them to HDFS/Hive/Impala.

Amazon Kinesis Analytics is a hosted stream processing framework that uses a SQL-like syntax. This post shows how to use it for analyzing apache log data that is ingested to a Kinesis stream.

Writing and deploying user defined functions for systems like Hive is often an involved and error-prone process. With SparkSQL, on the other hand, SQL is executed from within Scala/Java/Python code. This makes it relatively straightforward to implement a user-defined function that's used by a SQL query. This post shows how and gives some examples.


Syncsort's expert interview series has a three-part interview with Confluent cofounder and CTO Neha Narkhede. The interview covers the business of Kafka and the origins of Confluent, describes the Confluent Schema Registry, and has a discussion of the network of women in big data (including the emphasis of diversity and inclusion at Confluent).

If you were wondering if Apache Kafka is taking off or not, Confluent has put out a press release about their impressive growth in 2016. In addition to company results, they highlight that Kafka has a lot of adoption in banking, insurance, telecom, and travel.


Apache Atlas 0.7.1-incubating was announced. There are a large number of bug fixes and several minor improvements included in the release.

Cloudera Enterprise 5.10 was released with GA support for Apache Kudu, improved cloud performance, improved governance for data in Amazon S3, and more.

Hortonworks Data Cloud for AWS version 1.11 was released with support for compute nodes and spot instances as well as node recipes for customizing server setup.

Version of the StreamSets Data Collector was released. Highlights of the release include mulithreaded pipelines, multitable copy support, MongoDB change data capture, and HTTP API support for Elasticsearch.

Apache Bahir, which provides extensions for Apache Spark, released version 2.0.2.


Curated by Datadog ( )



Apache Kafka and Kafka Streams (Minneapolis) - Tuesday, February 7


ETL Using MapReduce in Java + More (Winter Park) - Monday, February 6

New York

Apache Spark Fine-Grained Security with Apache Ranger + SparkR Updates (New York) - Monday, February 6

Crunching Streams of Data: An Introduction to Akka Streams (New York) - Thursday, February 9


Building Real-Time Data Pipelines with Spark (Boston) - Monday, February 6

Pre-Spark Summit East: Presentations + Q&A (Boston) - Tuesday, February 7

Introduction to Spark Structured Streaming (Cambridge) - Thursday, February 9

The Future of Spark on Cloud Storage + Rapid Miner (Boston) - Thursday, February 9


Data Governance in Hadoop Environments (Kontich) - Wednesday, February 8


Pre-Spark Summit East: Behind the Scenes and Exception Handling (Amsterdam) - Wednesday, February 8

Apache Spark & Scala at Scale (Capelle aan den IJssel) - Friday, February 10


YARN Scheduling (Bangalore) - Saturday, February 11