Data Eng Weekly

Hadoop Weekly Issue #139

27 September 2015

Strata + Hadoop World is this week in NYC, and there are a large number of related meetups taking place for those in NYC. In anticipation of the conference, there are a few new releases, and we can expect to see many more announcements this week. I won't be attending, so please send along any interesting news and presentations that you see.


If you like reading about distributed systems or are interested in learning more about the CAP theorem, then Martin Kleppmann's "A Critique of the CAP Theorem" is for you. It discusses the theorem and many of the common confusions in terminology. It then proposes an alternative to the CAP theorem, which is aimed at helping practitioners reason about common trade-offs.

The Apache blog describes building an Apache NiFi flow that ingests tweets from the Twitter API, does some light-weight processing, and stores the resulting tweets into Solr. It demonstrates some of NiFi's built-in tools, such as json evaluation and batching.

The Databricks blog has a post that gives an overview of Spark's implementation of Latent Dirichlet Allocation (LDA). Spark implements an online variant of the algorithm, which improves performance and scalability. The post links to example code on github and provides a number of tips for using LDA.

Spark Testing Base is a library for testing Spark code in Scala and Java. This post gives an overview of the functionality, which includes the ability to test non-trivial jobs (such a Spark streaming).

This post articulates several reasons why it's a good idea to invest in operating a centralized schema registry for a data platform. Reasons include enforcing safe schema evolution, storage efficiency, data discovery, and data policy enforcement. The post also describes why it's critical for stream processing.

Erasure codings are a well-known mechanism of data protection that can incur less overhead than Hadoop's three-way replication. Adding this to HDFS was proposed over five years ago, and engineers from Cloudera and Intel are working on it for the upcoming Hadoop 3.0 release. This blog post has an in-depth overview of the strategy and implementation, which takes advantage of hardware acceleration for encoding and decoding parity data.

Hue includes Livy, a REST interface for interacting with Spark. This post describes how to start Livy to run Spark jobs, and it gives examples of starting a Spark shell and entering commands via the REST api.

Unlike java or scala libraries, python libraries often aren't portable across machines. This can cause problems for a distributed computation with PySpark, but there are a few strategies to distribute the necessary libraries. This post describes them (e.g. shipping a py file, py egg, setting up a virtualenv on each node) and when each is most appropriate.

This post describes Coursera's data infrastructure, which ties together Cassandra, Scalding, Amazon Redshift, and more. They use Dataduct, which is a python framework for the AWS Data Pipeline to manage workflows.


The O'Reilly Radar blog has a post about how the Apache Drill project grew a community and how the community helped shape the project. For example, an early design meeting was streamed for remote participants outside of the bay area.

Venturebeat reports that Cloudera is working on a new storage engine called Kudu, which aims to have features fitting between HBase and HDFS.

MapR has a post introducing Apache Flink. The article describes the origins of the projects, the meaning of the name "Flink," and Flink's event-based stream processing. On the topic of stream processing, it compares when streaming makes sense as compared to micro-batching.


Version 0.3.0 of Apache NiFi, the data processing and distribution system, was released this week. This release includes performance improvements, integration with Ambari, support for processing images, support for Kerberos Hadoop clusters, and new Avro capabilities.

Spark-Timeseries is a new library for working with time series data from spark. It provides an abstraction for time series datasets and includes support for various manipulation functions (e.g. aligning, missing value imputation) and stats/models (such as exponentially weighted moving average).

Apache Sentry 1.6.0-incubating was released this week. Sentry is a system for fine-grained access control in Hadoop, and the new release adds a Sqoop2 integration, a new dump/load tool, and more. The release also contains a number of bug fixes and improvements.

Apache Accumulo, the distributed key-value store, released version 1.5.4. The bug-fix release includes a fix for a data-loss bug.

Version 2.0 of BlueData EPIC was announced. The release switched to a docker-based deployment system, which provides the flexibility of managing a cluster of virtualized machines in addition to physical machines. Other highlights include support for Apache Zeppelin and an app store for installing partner applications.

Google Cloud Dataproc is a new offering from the Google Cloud Platform for deploying Hadoop and Spark clusters. The system is integrated with Google's other cloud services and is priced at 1 cent per virtual CPU per hour (atop of normal instance cost).

Cask has released version 3.2 of the Cask Data Application Platform. The new release includes Cask Hydrator—a framework and UI for batch/real-time data ingestion and ETL, new auditing and lineage support, views, and more.

Cascading-Flink is a new project to use Apache Flink as the execution engine for Cascading flows. Key features include sophisticated memory management (reduce the risk for OutOfMemoryErrors) and performance improvements for flows with type information. The project doesn't yet support hash-based outer joins and it relies on a development version of Apache Flink.

Apache Hadoop 2.6.1 was released with critical fixes which have been back-ported from the 2.7 and 2.8 development trees.


Curated by Datadog ( )



Scalable Machine Learning at Yahoo (San Jose) - Monday, September 28

Introduction to BigQuery (Clovis) - Thursday, October 1


Enterprise Dataflow with Apache NiFi (Tempe) - Thursday, October 1


Spark DataFrames (Chicago) - Tuesday, September 29


Learn about Improvements in Apache Spark (Madison) - Tuesday, September 29


Apache Ranger for Securing Hadoop (Atlanta) - Wednesday, September 30

North Carolina

September CHUG Event: SnapLogic (Charlotte) - Wednesday, September 30

New York

Rethinking SQL for Big Data with Apache Drill (New York) - Monday, September 28

One Hadoop, Multiple Clouds (New York) - Monday, September 28

Meetup at Strata + Hadoop World NYC 2015 (New York) - Monday, September 28

Best Practices for PySpark, with Juliet Hougland of Cloudera (New York) - Tuesday, September 29

Using Python at Scale for Data Science, with Wes McKinney (New York) - Tuesday, September 29

Hadoop World NYC 2015 (New York) - Tuesday, September 29

Resolving Transactional Access/Analytic Performance Trade-Offs in Hadoop (New York) - Tuesday, September 29

Committer Night: Spark 1.5 and Beyond (New York) - Tuesday, September 29

HBase Meetup (New York) - Tuesday, September 29

Oryx 2: Lambda Architecture on Spark, Kafka for Real-Time Large Scale ML (New York City) - Tuesday, September 29

Impala Lightning Talks in NYC (New York) - Tuesday, September 29

Hadoop World 2015 (New York) - Wednesday, September 30

MADlib + HAWQ for Advanced SQL Machine Learning on Hadoop (New York) - Thursday, October 1

Twitter Heron: Stream Processing at Scale (New York) - Thursday, October 1


1st Meetup of Hadoop User Group Rennes (Rennes) - Wednesday, September 30

Hadoop Meetup Sur La Seine (Paris) - Thursday, October 1


Cascading on Flink & Tracking the Trackers with Flink (Berlin) - Wednesday, September 30


Big Data Meetup: September 2015 (Budapest) - Monday, September 28


5th BigData/DataScience Cluj-Napoca Meetup (Cluj-Napoca) - Wednesday, September 30