Data Eng Weekly

Hadoop Weekly Issue #167

25 April 2016

Welcome to a special Monday edition of Hadoop Weekly. There's lots of great technical content this week from Spark to Kafka to Beam to Kudu. If you're looking for something even more bleeding edge than some of those technologies, Apache Metron (incubating) had its first release. Metron, which is a general-purpose security system built on Hadoop, is a project to keep an eye on going forward.


This presentation serves as a guide to building a stream processing system in AWS. It describes relatively simple solutions such as Amazon Kinesis with AWS Lambda and the Kineses S3 connector as well as more complex solutions for real-time analytics that make use of many AWS solutions.

This post describes how to use Spark Testing Base, which is a testing framework for Spark written in Scala, from Java. The example code shows how to refactor Spark code to isolate the logic to test as well as how to deal with some of the gnarly Scala APIs from Java.

The Altiscale blog has an overview of the pros and cons of building thin and uber jars when working with Spark. There are examples of building both types in Maven and SBT.

LinkedIn has posted about their Kafka ecosystem, which includes a special Kafka producer, a REST API for non-java clients, monitoring, an avro schema registry, Gobblin (a tool for loading data to Hadoop), and more.

This tutorial on Spark Streaming shows how to pull tweets using the twitter4j API, filter based on hashtag, and perform sentiment analysis on the tweets as they're processed.

Apache Kudu (incubating) is an exciting companion to Apache Impala (incubating) because it can efficiently answer both broad analytics and very targeted queries. This post describes the technical details of the integration, how Kudu's design provides efficient querying capabilities, how to perform write/update/delete operations with Impala and Kudu, and more.

MapR has a post about using spark-sklearn to scale out an existing scikit-learn model. It walks through building a model from the Inside Airbnb dataset and describes how to plug in spark-sklearn for cross validation.

The AWS big data blog has a tutorial describing how to use HBase and Hive with Amazon EMR. The post includes an introduction to HBase, describes how to restore a HBase table from S3, demonstrates Hive and HBase integration, and more.

This post describes some of the challenges in providing real-world experience to students taking a big data course. The author has gone through several iterations and options and seems to have finally landed on a good solution—Altiscale's Hadoop-as-a-Service.

The Cloudera blog has a guest post in which the author compares Parquet and Avro across two data sets—one that's narrow (3 column) and one that's wide (103 column). Using test query/operations in Spark and Spark SQL, the author finds that queries against Parquet and Avro serialized data sometimes perform similarly, although queries against Parquet data are much faster (and serialize data much smaller) in many cases.

This article describes how to use SparkR with a distribution, like CDH, that doesn't officially support it. By leveraging YARN and locally installed R packages on the workers, jobs can be executed with little additional work.

There have been a number of open-source frameworks to execute MapReduce and similar jobs with a higher-level programming model. Historically, these have been tied to individual execution frameworks (e.g. MapReduce, Storm), but there's recently been work to make them agnostic. Apache Beam (incubating) aims to take that even further, generalizing across execution models for both batch and streaming and offering built-in support for complex compute models.

The Apache blog has a 7-part series presenting experimental results for HBase write throughput across HDD, SSD, and RAMDISK. In performing the analysis, the authors found and proposed fixes to a few uncovered issues in HBase and HDFS.


Tom White, the author of "Hadoop: The Definitive Guide," has written about how he became involved in Apache Hadoop. His early contributions were around integration Hadoop with Amazon Web Services, which has been an important part of the project's success.

Fluo, which is a distributed processing engine for Apache Accumulo, has been submitted to the Apache incubator.

A new conference for Apache Phoenix, the SQL-on-HBase system, has been announced for the day after HBaseCon. The conference is half-day, and will feature tracks on Phoenix internals and use cases.


Apache Metron, a security framework built on Hadoop, has released version 0.1. Hortonworks is supporting it as a tech preview, and has written about the features, how to get started, how to contribute, how to use the Metron UI, and more.

Apache NiFi 0.6.1 was released this week. It's a bug fix release that addresses just over 10 bugs.

Apache Flink 1.0.2 was released this week. The new release includes bug fixes, a performance improvement when using RocksDB, and several improvements to documentation.

Amazon has announced a new version of Amazon EMR with support for HBase 1.2.


Curated by Datadog ( )



Spark 101 (San Francisco) - Tuesday, April 26

Big Data Application Meetup (Palo Alto) - Wednesday, April 27

Tackling Data Challenges at Netflix and Twitter (Los Gatos) - Wednesday, April 27

Apache Flink Technical Deep Dive w/ Stephan Ewen! (Palo Alto) - Thursday, April 28

Spark with Couchbase to Electrify Your Data Processing (Santa Monica) - Thursday, April 28


"What Is All the Hype about Apache Spark" (Denver) - Tuesday, April 26


Data Science @ Blue Coat - Chris Larsen Speaking (Draper) - Thursday, April 28


Big Data Architecture for O&G (Houston) - Tuesday, April 26

Oil and Gas Use Case: Spin Up & Visualize (Addison) - Thursday, April 28


Overview and Demo of the Apache NiFi Project (Madison) - Tuesday, April 26


April Edition of MOHUG (Dublin) - Tuesday, April 26


How the Weather Company Leverages Billions of Data Points & Predictive Analytics (Atlanta) - Wednesday, April 27

HBase as a File System (Roswell) - Wednesday, April 27


Spark Saturday DC (McLean) - Saturday, April 30


Apache NiFi: Because It Ain’t Data Science without the Data (Laurel) - Wednesday, April 27

New York

Building Data Pipelines for Solr with Apache NiFi (New York) - Tuesday, April 26

Analysis of Streaming Sensor Data with Spark & Kafka on Bluemix (New York) - Wednesday, April 27


Integration of Apache Kafka with Apache Spark (Toronto) - Wednesday, April 27

April Meetup (Ottawa) - Wednesday, April 27


Apache Kudu Intro: Storage for Fast Analytics on Fast Data (London) - Thursday, April 28


Big Data, No Fluff: Let’s Get Started with Hadoop #7 (Oslo) - Thursday, April 28


Spark as the Catalyst for Advanced Analytics (Stockholm) - Wednesday, April 27


Spark Meetup at Criteo (Paris) - Thursday, April 28


Spark Streaming: Dealing with State, by Francois Garillot (Renens) - Thursday, April 28


Introducing Apache Ignite (Warsaw) - Tuesday, April 26

Tabular Data Analysis in Apache Spark Using DataFrames (Warsaw) - Wednesday, April 27


Introduction to Flink Streaming (Bangalore) - Saturday, April 30


Kafka and OrientDB (Sydney) - Tuesday, April 26