Data Eng Weekly

Hadoop Weekly Issue #204

12 February 2017

The content in this week's issue is notable because it includes a number of projects at the periphery of the Hadoop ecosystem (pachyderm, MongoDB, RethinkDB, and Hazelcast). If you're just here for the Hadoop ecosystem news, don't worry, there is great coverage of Spark on YARN, PySpark and Apache Arrow, and Kafka Streams.


Pachyderm, which is a data lake that supports version control for data, includes a data processing API. This article and referenced sample code demonstrate how to use the python API to join two datasets, and it describes how pachyderm takes care of sharding and distributing the data as necessary.

While configuring and monitoring a highly-available YARN cluster isn't the easiest thing, it does offer some clear advantages as a platform for running Spark. These include support for multiple versions of Spark, resource management, and resilience. The Altiscale blog elaborates on these and other benefits.

This post shares an analysis of MongoDB's replication and durability guarantees in the face of Jepsen testing (which introduces network partitions and other failure scenarios). MongoDB's previous replication system has inherent flaws, but a new replication system (based on Raft) fixes the fatal flaws. In fact, no major bugs related to lost updates, dirty reads, or stale reads were found in version 3.4.0 of MongoDB.

This presentation describes how Apache Arrow speeds up the bridge between python and the JVM for PySpark. The talk starts by describing how PySpark interacts with the Spark execution engine, and it then describes some of the improvements that already exist and some speed ups that are coming down the pike.

This post demonstrates using Pivotal Cloud Foundry to launch a PySpark application to train a linear model and to launch a python Flask application to serve predictions based on the trained model coefficients.

Cloudera has a post demonstrating analysis of flight data with sparklyr, the Spark-based backend for dplyr.

AWS has published scripts for importing Hive table definitions into Amazon Athena (the Presto-based, hosted big data query engine), and a blog post that describes how to use them.

The Databricks blog has an example of using the Intel BigDL project for deep learning with Apache Spark. The post describes how to get started, including training the model and evaluating the predictions it makes.

This tutorial builds on a Kafka Streams application that consumes demographic data about countries to create a streaming calculation of the top-3 countries by population within each continent.


The first two chapters of "Data Science on the Google Platform" are available as part of the O'Reilly early release.

The Cloud Native Computing Foundation (CNCF) has purchased the rights to RethinkDB, and they have re-licensed it under the Apache License. This adds a strong distributed database system to the CNCF portfolio of hosted projects which includes Fluentd and Kubernetes.

Apache Ranger, the security system for the Hadoop ecosystem, has graduated to be a top-level project. The announcement has a good overview of the features Ranger provides as well as some companies that are using it.

Hadoop as a Service vendor Qubole has announced that they're now SOC 2 Type II compliant.

Spark Summit East was last week in Boston. This post has summaries of and links to videos/slides from a number of sessions at the Summit.


The Hortonworks blog has a recap of the features in the recently released Apache Zeppelin 0.7.0. Key features include improvements to multi-user support, a new pluggable visualzation, and spark improvements (adding support for Spark 2.1).

Apache Flink 1.2.0 was released. It's a giant release that resolves over 650 issues but maintains backwards compatibility with all public apis. This post has an overview of the key features, including dynamic scaling of streaming jobs (by restoring from a savepoint), support for running Flink with Apache Mesos, experimental support for encryption in transit, and major improvements to the Table API.

In addition to their compliance news, Qubole has announced the general availability of the Qubole Data Service on the Oracle Bare Metal Cloud Service.

MapR has announced the MapR Converged Data Platform for Docker, which provides a mechanism for running Docker containers atop of MapR. Using the MapR file system and MapR streams, microservices can be relocated to another server in the cluster without losing any state.

Apache Beam 0.5.0 was released this week. This release adds new apis for State and Timers. According to the JIRA release notes, over 25 bugs were resolved and 25+ improvements and new features are part of the new version.

Hazelcast, makers of the in-memory data grid of the same name, have announced a new open-source distributed processing sysem called Hazelcast Jet. Jet uses cooperative multithreading to take advantage of multi-core CPUs, implements distributed support for, and more. The code is available on GitHub.


Curated by Datadog ( )



#SDBigData Meetup #20 (San Diego) - Wednesday, February 15

55th Bay Area Hadoop User Group Meetup (Sunnyvale) - Wednesday, February 15

Big-Data-As-A-Service: Big Data Analytics on AWS (Santa Clara) - Wednesday, February 15

Stream Processing with Apache Kafka & Apache Samza (Sunnyvale) - Thursday, February 16

Mining Member Feedback to Improve the Customer Experience: Nishant Hegde from Netflix (Culver City) - Thursday, February 16


Data in Motion with Open Source Apache NiFi (Bellevue) - Wednesday, February 15


Keeping Spark on Track: Best Practices Using Apache Spark in Production (Chicago) - Monday, February 13


Hybrid Transactional/Analytical Processing Using Spark & In-Memory Data Fabrics (Tysons) - Thursday, February 16

New York

Introduction to Sendence Wallaroo: An Industrial-Grade Streaming Data Platform (New York) - Thursday, February 16


Distributed Redundant Queueing with Apache Kafka (Kitchener) - Wednesday, February 15


Apache Flink Meetup (Berlin) - Thursday, February 16


Hadoop User Group Meetup (Vienna) - Tuesday, February 14


How It Works at Hortonworks (Prague) - Thursday, February 16


Resilient Events Handling & Kappa Architecture (Herzeliyya) - Wednesday, February 15


Apache Kafka at Use Cases and Production Considerations (Cape Town) - Wednesday, February 15