Data Eng Weekly

Hadoop Weekly Issue #169

08 May 2016

This week's issue is short and sweet. Topics covered include Apache Beam, MapR's quarterly results, the recent Kafka Summit, and a new open-source distributed unit test framework from Cloudera.


Elastic has written a root cause analysis of recent outages. A misconfigured ZooKeeper memory setting caused excess garbage collection, which ultimately lead to loss of the ZooKeeper quorum. The post describes a number of mitigation strategies they've implemented to prevent a similar problem in the future.

The Cask blog has a recap of the recent Big Data Applications Meetup. The first of the talks was about Pachyderm, which is based on Docker containers and provides "Git for your data" semantics. The second was about the big data platform at TubeMogul, which is built on Hadoop, Hive, Spark, and Presto.

Google and dataArtisans have both written about Apache Beam (formerly the Google Dataflow SDK). The Google post explains their motivation for open-sourcing and developing Beam, and the dataArtisans post talks about their support for the Beam model and how one should think about the relationship between the Flink and Beam APIs.

The IBM Hadoop dev blog has a run book for installing the Python, Scala, and R kernels for Jupyter notebooks. The post also describes how to connect to Spark and expose the notebook over SSL.

This post describes how the Mongo Hadoop connector functions as a go-between for Spark and MongoDB.

The Qubole blog has a post comparing the newest of the programming languages used for big data analysis—Python, R, and Scala.


MapR announced that they had a record quarter with 99% growth in subscription licenses and a 146% dollar-based net expansion rate.

This article describes a recent benchmark comparing Google Cloud Dataflow and Apache Spark on the Google Compute Engine. Dataflow outperformed Spark 2x-5.7x (as always, it's best to evaluate your own workload rather than trusting benchmarks). The post also describes a "cold war" that is benefiting everyone using big data tools.

The Confluent blog has a recap from the recent Kafka Summit covering the pre-conference hackathon, keynotes, breakout sessions, and more.

Forbes has an overview of American Express' journey over the past five-years to adopt big data technologies. In the article, AMEX shares some tips and lessons learned, such as the difficulty of adopting new technologies (and how important buy-in from the top of the organization is), the challenge of hiring and retaining engineers, and more.


Cask has announced version 3.4 of the Cask Data Application Platform (CDAP). The new release adds Cask Tracker, a new data lineage/audit/search system, updates the UI for Cask Hydrator, enhances Spark support, and more.

Cloudera has open-sourced dist_test, a new tool for running unit tests in parallel. With this tool, the unit test for projects like Hadoop and Kudu run in minutes instead of hours. The tools has bindings for both C++ and Java, and there's a website demoing its features.

Google has announced a new integration between Google BigQuery and Drive to support saving of output to Google sheets.


Curated by Datadog ( )



GE IoT Predix Time Series & Data Ingestion Service Using Apache Apex (San Jose) - Tuesday, May 10


Apache Spark Workshop and Combining ML Frameworks with Apache Spark (Bellevue) - Thursday, May 12


Introduction to Big Data Analytics Using Apache Spark and Apache Zeppelin (Chicago) - Thursday, May 12


Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, May 9


Apache Ranger: Securing Big Data in Hadoop (Reston) - Wednesday, May 11

New Jersey

Apache NiFi: Deep Dive - Ingestion Technology (Hamilton) - Tuesday, May 10

New York

Apache Storm 1.0 with Taylor Goetz (New York) - Wednesday, May 11

Spark for Reactive Machine Learning: Building Intelligent Agents at Scale (New York) - Wednesday, May 11


Spark with C* + Testing/Modelling in Ruby (Toronto) - Tuesday, May 10

Vancouver Spark Meetup: ApacheCon Extravaganza (Vancouver) - Tuesday, May 10

IRELAND Scaling Up Genomics with Spark + Understanding Your Customers Using Public Data (Dublin) - Monday, May 9


Spark Streaming Double Bill (London) - Thursday, May 12


Use of Hadoop for Large Scale Machine Learning at Yahoo (Trondheim) - Wednesday, May 11


Cassandra Introduction & Dashboarding with Spark/Cassandra (Kontich) - Monday, May 9


Big Data, Frankfurt v 2.0 (Frankfurt) - Thursday, May 12


Second Spark Meetup (Pune) - Thursday, May 12

High-Speed Connectors for Spark (Bangalore) - Saturday, May 14


Fault Tolerant Streaming + Spark & Cassandra + Operationalise Machine Learning (Sydney) - Tuesday, May 10