Data Eng Weekly

Hadoop Weekly Issue #152

10 January 2016

This week's issue has a lot of great technical content (including a bunch that I missed from December). Topics covered include performance testing of stream processing systems, new features in Apache Spark 1.6.0, and Apache Ranger. There's lots of great stuff demonstrating that 2016 is going to be an exciting year for the Hadoop ecosystem.


The Storm team at Yahoo has done a performance comparison of Flink, Storm, and Spark Streaming. The benchmark includes reading/deserializing JSON data, performing a filter and a join (with data from a Redis cluster), and windowing to count events and store them in Redis. For this use case, they measured throughput and latency on all three system. The post describes some of the key configuration settings and evaluation details. It concludes that there isn't a clear winner but finds that Storm and Flink show sub-second latencies at high throughputs whereas Spark streaming shows even higher throughput but at higher latencies.

This post shows how to use Apache Drill to analyze astronomical data about the International Space Station and the Sun in order to identify the date of a picture of ISS solar transits. Drill supports some pretty complicated vector math in order to compute the relevant data points.

This post describes a few different mechanisms for setting up a Hadoop cluster on an Ubuntu server. It describes an install via debian packages (from the Apache Bigtop project) via the hadoop-ppa, a build via Bigtop for running as a Docker container, and a dev setup using mrjob (a python library) with Elastic MapReduce. For each, there are details of custom configs and instructions for running a simple MapReduce job to compute the value of pi.

Spark 1.6.0 (released this week, more below) adds Spark Datasets, which is a new type-safe API built atop of the DataFrame API. An introductory post shows some examples and quantifies (with some example benchmarks) how it improves memory usage and execution time. The API is available from Java and Scala.

The Cloudera blog highlights some improvements to how Apache Impala (incubating) handles Parquet data. Specifically, some pitfalls related to how Parquet and HDFS independently tune block size are now handled more smoothly.

The IBM developer blog has posted some preliminary benchmarks of the recently released Apache Spark 1.6.0. The release contains a number of changes, including performance optimizations, which impact workflows in different ways. They compared performance of JSON processing, MLlib's K-Means, and SparkSQL queries across 3 (or 4) recent versions of Spark.

This post describes how Apache Ranger integrates with Apache Hadoop HDFS to secure access. Ranger provides centralized security policy management that works in conjunction with HDFS' built-in controls. The post includes some examples of configuring Ranger policies for a directory in HDFS.

The big data analytics team at Cigna has built a stream-processing application that consumes data from Kafka, processes the data via Spark Streaming, and makes the data query-able via a RESTful HTTP API. The RESTful API pulls data from Impala using the Impyla Python API. The post describes a number of performance enhancements—configuration changes and improvements to caching and partitioning. These tunings and learnings should be really useful for anyone working with Spark Streaming and Kafka.

Hortonworks has posted a list of their most popular blog posts from 2015. These are mostly technical, covering topics like Hive, Storm, Spark, and releases of HDP.

This tutorial shows how to setup Elastic MapReduce with a separate instance of Apache Zeppelin for submitting jobs. This has the advantage of supporting multiple (or zero) clusters without needing to make major changes to the Zeppelin instance.

The Cloudera blog has a post showing how to hookup the Ibis python library to Kudu to interact with data stored there. The article describes Kudu, demonstrates the Kudu python library, and shows how to use Ibis with Kudu tables.

The Confluent "Log Compaction" blog has a bunch of highlights of recent developments in the Kafka community. There are lots of links and quick details about exciting new features (e.g. Kafka Connect and removing the zookeeper dependency for clients) and use-cases (Kafka at Microsoft, Kafka with Spring).


The Databricks blog has an article reviewing the progress of Spark over the past year. It covers the community evolution and adoption, new data science/platform/streaming APIs, and performance optimization work.

This article from the MapR blog has six predictions for big data for the next year. They include increased interest in streaming data, shorter time to value, centralization, and rapid adoption of Hadoop for healthcare and telecommunications.


Apache Spark 1.6.0 was released this week. Among the major changes (the release includes many across several components) are a new Dataset API, unified memory management, improved parquet performance, improved state management for Spark Streaming, and several new algorithms for MLlib (online hypothesis testing, bisecting k-means clustering, and more).


Curated by Datadog ( )



HadoopSF January 2016 Meetup (San Francisco) - Tuesday, January 12


Jump Start into Apache Spark (Seattle) - Tuesday, January 12


Analytics with Spark and Cassandra (Denver) - Tuesday, January 12


Continuous Data Management for Hadoop and Spark: On-Premise or in the Cloud (Chicago) - Wednesday, January 13

South Carolina

Data Analytics Infrastructure (Charleston) - Tuesday, January 12

New York

Querying Network Packet Captures with Spark and Drill (New York) - Wednesday, January 13

First Meetup - Reactive Monitoring and Distributed Streaming (New York) - Thursday, January 14


Continuous Data Management for Hadoop and Spark: On-Premise or in the Cloud (Boston) - Tuesday, January 12

Open Analytics Boston: Short Talks & Demos (Boston) - Thursday, January 14


Spark Basics (Montreal) - Wednesday, January 13

IRELAND Elastic Big Data Processing with Myriad and Mesos. ETL Use Cases and Hadoop (Dublin) - Monday, January 11


Apache Spark, Scala, Reactive Technologies and Machine Learning Discussions (Berlin) - Tuesday, January 12


Dive into Hadoop (HDInsight): Common Big Data Analysis Scenarios on Microsoft Azure (Krakow) - Wednesday, January 13


Big Data Meetup - January 2016 (Budapest) - Monday, January 11


Spark Installation & MLLib - Wednesday, January 13