Data Eng Weekly

Hadoop Weekly Issue #186

11 September 2016

Apache NiFi hit 1.0.0 this week and Apache Hadoop announced version 3.0.0-alpha1. In addition to those big releases, there were a number of minor releases, and there are several interesting technical and news articles.


This post looks at the Apache Beam and Apache Spark programming models. The author argues that Beam's programming model and optimization capabilities make it a strong choice for stream processing, and that the sweet spot for Spark (and something Beam doesn't support well) is iterative algorithms. The post includes code implementing the same non-trivial use-case in both frameworks.

The Amazon Big Data blog has a post that describes how to use a number of their services—VPC Flow Logs, Lambda, and Elastic MapReduce—to analyze traffic inside of a VPC. The actual analysis is implemented as two streaming MapReduce jobs written in python and awk.

The Hortonworks blog has the first part in a two-part series on Apache Ranger's dynamic column masking and row-level filtering. Part one motivates the features with example use-cases.

JVM startup time can be a major limitation in trying to build a system with interactive queries. The Qubole blog describes how they solved this problem by using a ruby client for invoking Presto queries.

Apache Ambari Views Server is a standalone server that doesn't manage a Hadoop cluster but provides access to Views (e.g. to browse the file system). This tutorial describes how to configure an Ambari Views Server in version 2.4.

Skool is a new open-source tool from BT for data loading to Hadoop. It automates the process of setting up a new database -> HDFS job by validating connectivity, doing a test import, and generating an Oozie workflow to schedule incremental imports. The introductory post has more details on the tool and plans for future features.

Using an example data set of mobile network data, this tutorial describes how to build a streaming pipeline using Spark streaming and MapR streams. The examples use the Kafka API, so they should be mostly applicable to a Kafka cluster as well.


Hadoop Summit Melbourne was this week. This post contains an overview of the event and summaries of talks from Hortonworks, Telstra, Yahoo! Japan, and the Barcelona Supercomputing Centre.

insideBIGDATA has an interview with Justin Kestelyn of Cloudera in which they discuss Cloudera's focus and changes over the past year, Apache Kudu, the future of Cloudera, and more.

The speaker lineup for Crunch Conference, the data engineering/analytics conference, has been posted. The conference takes place in Budapest from October 5th through October 7th.


Apache NiFI released version 1.0.0 with a refreshed UI, automatic cluster coordination, improved authorization, and better support for version control of data flow templates.

Apache Flink 1.1.2 is a bugfix release that resolves 19 tickets.

IBM announced a technical preview of IBM Open Platform with Apache Hadoop 4.3. The release includes Apache Spark 2.0.

Apache Accumulo 1.8.0, which resolves over 200 bugs and contains 75 improvements/new features, was released. Major changes include speedups (WAL rollover improvements, rate-limiting of major compactions), improved API support for RFiles, table sampling, upgrading to Apache Thrift 0.9.3, and the ability to run mutiple tablet servers on a single node.

Version 3.5.0 of the Malhar library for Apache Apex was released. The release includes a new windowing operator to fit with Apache Beam semantics, spillable data structures, a deduper task, and more.

The Apache NiFi project announced a new release artifact—MiNiFi C++—which is a new C++ library for collecting sensor data where the JVM isn't practical. Version 0.0.1 is now available.

Apache Storm has announced a maintenance release, version 0.9.7, which resolves issues with multi-lang support.

Confluent announced version 3.0.1 of the Confluent Platform based on Apache Kafka

Apache Hadoop has announced the first alpha release of Hadoop 3.0. As noted in the announcement, the release contains thousands of changes since the 2.7.x release. Major highlights (from the release overview) include a move to Java 8, support for erasure encodings in HDFS, a new YARN Timeline Service, a shell script rewrite, support for more than two NameNodes, intra-datanode balancing, and more.

DataMountaineer, Confluent, and DataStax have announced a Certified DataStax connector for Kafka Connect.


Curated by Datadog ( )



September 2016 Meetup (San Francisco) - Tuesday, September 13

Introduction to Succinct by UC Berkeley AmpLab (San Francisco) - Tuesday, September 13

Demo and Open House: Big Data Processing with Apache Spark (San Francisco) - Tuesday, September 13

Lambda-In-A-Box: Combining Spark and HBase (San Francisco) - Wednesday, September 14

Alluxio: Unifying APIs, Accelerating ML, & Enabling Cloud Architectures (San Francisco) - Wednesday, September 14

September Apache Kafka Meetup (Mountain View) - Thursday, September 15

How to Simplify Your Streaming Data Architecture with Kafka and Voltdb (Menlo Park) - Thursday, September 15

Explore Big Data at Speed of Thought with Spark 2.0 and Snappydata (Culver City) - Thursday, September 15


Top 5 Mistakes When Writing Spark Applications (Seattle) - Wednesday, September 14


Microsoft Azure HDInsight: Hadoop in the Cloud (Addison) - Monday, September 12


Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, September 12


Machine Learning and Graph Processing on Accumulo w/ Spark (Arlington) - Wednesday, September 14


Integrating Real-time Data Streams with Spark & Kafka (Atlanta) - Thursday, September 15

New Jersey

Data Pipelines with Kafka Connect (Princeton) - Wednesday, September 14

New York

Past, Present, and Future of Apache Ambari (New York) - Tuesday, September 13


Cloudera Meetup (Sao Paulo) - Friday, September 16


Jenkins Meets Spark: Building a Continuous Integration Intelligence (Stockholm) - Thursday, September 15


Kafka, Parallel Streams, Consul (Gdansk) - Wednesday, September 14

Apache Airflow: Best Practices and Roadmap (Warsaw) - Wednesday, September 14


Building Scalable Machine Learning Applications at Groupon (Belgrade) - Wednesday, September 14

Enter Kafka Streams (Novi Sad) - Thursday, September 15


September Meetup: Spark in Adobe & Intro in Kafka (Bucharest) - Wednesday, September 14


Apache Spark Day @ IDUG Australia (Sydney) - Tuesday, September 13