Data Eng Weekly

Hadoop Weekly Issue #190

16 October 2016

Welcome to a double-issue of Hadoop Weekly. After taking last week off, there's ton of content (mostly tech posts) covering everything from Spark to Hive to Kafka. There's quite a bit of breadth in this week's issue, so there should be something for everyone!


This post looks at the performance improvements in Spark 2.0 by comparing query speed between 1.6 and 2.0. By looking at the execution plan and a flame graph, it's possible to see the effects of code generation, which make the computation about 7x faster (2 vs. 20 mins).

This post provides a tutorial for some of the graph computation APIs available in Spark. To compute PageRank, the example makes use of GraphFrames, GraphX, DataFrames, SparkSQL, and more.

The Hortonworks blog has a post describing the evolution of support for Amazon S3 in Apache Hadoop. Notably, a number of performance improvements have been made on the read path as part of the "s3a" implementation that was added in 2014. Optimization of the write path is the next focus area for this library. As someone who has worked with Hadoop and S3, I'm really excited about this progress.

The AWS Big Data blog has a tutorial (with code) that builds a data pipeline using AWS Lambda, Amazon EMR, and Amazon Redshift to do batch analytics. All steps in the pipeline are event-driven—e.g. as data files arrive, they're processed by AWS Lambda.

This post describes a project to index tweets related to the recent Strata+Hadoop World in real-time using Hortonworks Data Flow (Apache NiFi). Tweets are sent to HBase via Apache Phoenix for analysis via Zeppelin (including sentiment analysis using tensorflow). The data is also sent to Slack via a NiFi plugin.

Hortonworks has published an article about Hive's move to "Live Long and Prosper" (LLAP, i.e. long-lived daemons) and improved memory optimization strategies. With these, they demonstrate 2-9x speedup on TCP-DS queries. Second, Hortonworks has published a comparison of Apache Impala (incubating) and Apache Hive on the same data set. As always, consider recreating benchmarks on your own data set. But with that caveat in mind, the experiment showed that Hive could complete more of the test queries in under 10 minutes than Impala (and that Impala timed out on a few).

Apache Kafka 0.9.0 introduced a new consumer API, which can be enabled for Kafka MirrorMaker by setting a flag. This post describes the major gotcha involved in switching to the new consumer for MirrorMaker and some options for dealing with the migration.

This post provides a useful demonstration of the semantics of adding or removing processes that are part of the same Kafka Streams application.

If you thought the flame graphs from the post on Spark 2.0 speedups look useful, then check out this post that describes how to get the JVM to record (using commercial features) data to generate your own flame graphs for Spark applications.

Hortonworks recently hosted a webinar on Hortonworks Data Flow. Based on the response, they've posted FAQs that cover topics related to Apache NiFi like performance, compatibility with the existing java messaging ecosystem, and support for data stored in a relational database.

Confluent's monthly Log Compaction series covers a number of the improvements underway to Apache Kafka. These include a proposal to add headers to Kafka messages and a mechanism to search for messages based on timestamp. The post also links to a number of recent talks from the Kafka ecosystems and has more information about the upcoming release.

The Databricks blog describes two use-case for AWS Lambda with a big data system in AWS (with an angle towards the Databricks service). The first example is using Lambda's integration with S3 to trigger a batch job as data arrives. Second, Lambda is a lightweight and scalable way for serving k/v data after it's been generated via a batch job and stored in a database like Riak.

The MapR blog features a whiteboard walkthrough (video+transcript) from MapR engineer and Apache Drill PMC Part Chandra. The post describes Drill's read pipeline for Apache Parquet files, with interesting details on read optimizations and data locality/row group size.

Oracle GoldenGate is a tool for replaying the OracleDB transaction log to replicate data to Apache Kafka. This post describes how to use the GoldGate Connector for Kafka Connect along with the Elasticsearch Connector to replicate changes to a ES cluster. The article includes a complete walkthrough of all the necessary configuration and commands needed to get started.

This post describes how to use StreamSets data collector to stream NetFlow data to Apache Kafka and on to Apache Kudu. Once there, the data can be queried for analysis—the post includes code for an example D3-based visualization.

While Hadoop streaming (the ability to run MapReduce jobs in non-JVM languages) has been around for a long time, this is the first time I've seen a tutorial that focusses on node.js. In addition to the basics, it describes how to package dependencies for a job using npm and HDFS.


Every week it seems like there's a new addition to the Hadoop ecosystem, so I agree with the premise of this article that it's time to retire some tools. The specific list the author puts forward—MapReduce, Storm, Pig, Java, Tez, Oozie, and Flume—is somewhat controversial (there are plenty of big companies invested in Pig, Storm, and Oozie), but it seems like a safe bet that we'll see consolidation in 2017.

BlueData and HPE have announced a partnership in which BlueData's docker-based EPIC system can be used for big data deployments.

Big Data Tech Warsaw was announced last week. The conference is February 9th.


Apache Impala 2.7.0-incubating was released last week.

Apache Flink 1.1.3 was released to address a number of bugs.

Apache Hadoop 2.6.5 was released. It includes over 75 resolved issues since the 2.6.4 release.

StreamSets has released version of their Data Collector with a new package manager, support for MapR FS as an origin, support for the Confluent Schema Registry, and support for ElasticSearch 2.4.

Apache Kudu 1.0.1 was released with seven bug fixes.


Curated by Datadog ( )



Making Data Science Easy, Part 1: Hadoop (Costa Mesa) - Tuesday, October 18

#SDBigData Meetup #18 (San Diego) - Wednesday, October 19

54th Bay Area Hadoop User Group Meetup (Sunnyvale) - Wednesday, October 19

NoSQL vs Hadoop Ecosystem, Part II (San Carlos) - Wednesday, October 19

Install and Admin of Apache HAWQ on Hortonworks with Apache Ambari (San Francisco) - Wednesday, October 19

Robust Stream Processing with Apache Flink (San Francisco) - Thursday, October 20

Sentiment Analysis on Twitter Data: Hadoop, Spark, NoSQL (San Francisco) - Thursday, October 20

Install and Admin of Apache HAWQ on Hortonworks with Apache Ambari (Santa Clara) - Thursday, October 20


Design Patterns for Working with Fast Data in Kafka (Portland) - Tuesday, October 18


Integrating Real-Time Video Data Streams with Spark and Kafka (Bellevue) - Wednesday, October 19


Return Path and Qubole Big Data Use Cases (Boulder) - Thursday, October 20


Introducing Sparklyr: Big Data with R and Spark (Austin) - Wednesday, October 19


Apache Beam on Apache Flink (Chicago) - Tuesday, October 18

North Carolina

Data Sig: Kafka for .NET Developers (Morrisville) - Wednesday, October 19

New Jersey

Apache Kudu: Structured Analytics => No HDFS and HBase? (Hamilton ) - Thursday, October 20


Text Classification Using Spark Machine Learning (Cambridge) - Thursday, October 20


Spark on HBase: The Current State of the Art (Montreal) - Tuesday, October 18


Spark Streaming and Scaling (London) - Thursday, October 20


Reactive Integrations with Akka Streams That Just Work! (Stockholm) - Tuesday, October 18


A Deep Dive Into Spark 2.0 + Unit Testing (Amsterdam) - Thursday, October 20


Building a Fully-Automated Fast Data Platform (Cologne) - Tuesday, October 18


Hadoop User Group Meetup (Vienna) - Tuesday, October 18


Real-Time Training and Deploying Spark ML Recommendations (Athens) - Monday, October 17