Data Eng Weekly

Hadoop Weekly Issue #141

11 October 2015

Spark is the topic of over half of the technical articles this week. As evidenced by new features and companies sharing practical knowledge, it is maturing (and gaining plenty of adoption) as a product. Aside from the great articles on Spark, I highly recommend the visualization covering the fundamentals of Raft's distributed consensus algorithm.


In Spark 1.5, SparkR gained support for distributed computation of generalized linear models. This tutorial shows how to use the SparkR APIs to perform to build a linear model for predicting airline delays.

This tutorial describes how to build a Apache Spark cluster on Amazon Web Services using spot instances (which provide a significant cost savings). The instructions describe using the AWS web console, installing Spark using a recent release, and configuring Spark's important settings.

The MapR blog has a guide to Spark Streaming, which discusses Spark Streaming's API and streaming model (microbatch). It also describes processing semantics (at least once, exactly once, at most once), which vary depending on the input source for Spark Streaming.

Datanami has an article describing how Uber has migrated from a data system built on Amazon EMR and Celery/Python ETL to a new system built on Spark and Kafka. Uber makes heavy use of Spark Streaming and Spark SQL, and they've built two Spark-based tools to keep the system running smoothly. The first, called Paricon, is used to validate data contracts when schema's change, and the second, called Komondor, takes care of common ingestion pieces (like dedup).

Compared to other distributed systems, Kafka is relatively easy to configure and operate. But that's not to say it never has problems—this presentation describes several situations where folks have experienced trouble.

The Stitch Fix blog has a post describing their experience with Spark. It covers how they think about when to use Spark, the Spark Data Source API, caching, the DataFrame API, and SparkSQL. There are some good tips and anecdotes—e.g. that Stick Fix converted some Python jobs to use the DataFrame API and saw 6x performance improvements.

This post describes some statistical tests added to Spark's MLlib for Goodness-of-Fit. It contains some background on the tests, and how they're implemented in Spark.

The MapR blog has a recap of the three talks given at the recent Bay Area Apache Flink Meetup. The talks covered stateful distributed stream processing, Gelly (the Flink graph processing API), and the future of Apache Flink.

Kudu, the new distributed storage engine from Cloudera, includes APIs in Java, C++, and Python (in alpha). These articles give an overview and introduction to the Kudu APIs in Python and Java.

Sparkling Water is a library for combining's machine learning APIs and UI with Apache Spark. This post describes how Spark and H2O work together (both the API and architecture) and walks through an example of building a deep learning model using Sparkling Water.

This visualization provides an excellent introduction to the Raft distributed consensus algorithm. During the visualization (which lasts about 5 minutes), several animations describe leader election and log replication. If you're a visual learner (or even if not), this is one of the best ways to learn the fundamentals of Raft.


The Call for Abstracts for Hadoop Summit Europe, which takes place in Dublin on April 13-14, is open until October 30th.


Apache Ignite, the in-memory data-fabric, released version 1.4.0 this week. It's the first release since Ignite graduated from the Apache incubator, and it adds SSL support, a faster JDBC driver, and more.

Apache Accumulo 1.6.4 was released. The new version of the distributed key-value store includes bug-fixes and performance improvements. Notably, this release contains a fix for silent data-loss during bulk import.

Cook is a new open-source Mesos framework scheduler from Two Sigma. Cook is a batch-scheduler designed to balance latency and throughput when there are more jobs than a Mesos cluster has capacity for. It has built-in support for Spark (including a Spark scheduler backend).


Curated by Datadog ( )



The Data Scientists' Guide to Apache Spark (San Francisco) - Monday, October 12

Spark at Thomson Reuters and Project Tungsten (San Francisco) - Tuesday, October 13

Samza October Meetup (Sunnyvale) - Tuesday, October 13

Impala: Tuning and Best Practices (San Mateo) - Wednesday, October 14


ML on Spark Roundtable (Bellevue ) - Wednesday, October 14


Learn about SpliceMachine (Houston) - Tuesday, October 13

A Deeper Dive Into Apache Drill and Big Data with MongoDB (Addison) - Tuesday, October 13


Ted Dunning: Data Science & Business Intelligence (Kennesaw) - Tuesday, October 13

New York

Transactions on Hadoop/HBase (New York) - Thursday, October 15


10th Spark London Meetup (London) - Monday, October 12

Deep Dive: Spark SQL+ DataFrames + Cassandra Connector (Edinburgh) - Tuesday, October 13

October HUG UK MeetUp (London) - Thursday, October 15

IRELAND Spark After Dark/Spark for Fraud Detection (Dublin) - Thursday, October 15


Flink Forward 2015 (Berlin) - Monday, October 12


Lambda & Kappa Architektura (Prague) - Thursday, October 15


Cloudera Meets StarSchema and VirtDB to Rock Your Data (Budapest) - Monday, October 12


Mumbai Spark Meetup 4Q2015 (Mumbai) - Saturday, October 17

Spark on YARN (Bangalore) - Saturday, October 17


Not Your Dad’s Old HBase (Melbourne) - Thursday, October 15