Data Eng Weekly

Hadoop Weekly Issue #145

15 November 2015

This issue is short and sweet, with coverage of Amazon EMR, Apache Apex, Apache Ambari, and more. Apache Spark, Apache Cassandra, and Apache Mahout all released new versions this week, and Succinct Spark is an interesting new project to keep an eye on.


Given the elasticity and capabilities (such as the S3 blob store) of the AWS cloud environment, AWS Elastic MapReduce has some unique features available. This article covers several of them—EMR's distinction between core and task nodes to support elasticity, the EMR FileSystem, S3DistCp, and more.

This post from the Cloudera blog describes how to continuously ingest data into HDFS and Hive (using one minute batches) for querying by Impala. In addition to the basic concepts, the post describes how to productionize this setup by using staging tables (which are compacted daily) and a view over the compacted and active staging tables.

This post describes how Apache Apex (incubating), the batch and steam processing platform, utilizes checkpointing for fault-tolerance. Checkpoint data is written to HDFS, either asynchronously or synchronously (depending on the delivery guarantees), which allows any node in the cluster to recover the state.

The IBM Hadoop Dev blog has a post describing how to create email alerts to track when the status of a service becomes UNKNOWN (which is the case during a YARN HA failover).

This tutorial covers setting up an Amazon EMR cluster with Apache Spark and Apache Zeppelin (incubating). Next, the post gives the steps (i.e. setting up an ssh tunnel) needed to access the Zeppelin web ui. From there, there are instructions for building a recommendation engine using the MovieLens dataset and a MatrixFactorization MLlib function.

This presentation describes best practices for several non-trivial features of Spark—RRD re-use, working with key-value data, Spark accumulators, and SparkSQL. The presentation also has a preview of some work underway in Spark MLlib.


Typesafe recently announced that they'll provide commercial support for Apache Spark.

This post lists 10 free Hadoop tutorials ranging from short to multi-step and in both text and video form.

The call for speakers for Strata+Hadoop World London, which takes place in May/June of 2016, is open until December 11th.


Apache Spark 1.5.2 was released this week. It's a maintenance release with over 60 resolved issues.

Apache Mahout 0.11.1 was also released this week. The new version contains 10 bug fixes and several performance improvements—to Spark support, to dot product calculations, and to %*% calculations.

The 3.0 release of Apache Cassandra contains a number of new performance optimizations, data storage savings, and developer enhancements. The new version was released this week.

Succinct Spark is a new library for interacting with data in the Succinct distributed store. Succinct compresses and indexes input data, which can lead to massive speedups to Spark programs in certain situations. The introduction describes several ways that Succinct is integrated with Spark, provides example code, and describes performance benefits.


Curated by Datadog ( )



Hive Contributors Meetup (Santa Clara) - Monday, November 16

Intro to Apache Spark for Java and Scala Developers (Mountain View) - Wednesday, November 18

One Hadoop, Multiple Clouds (Palo Alto) - Wednesday, November 18

Kafka November Meetup (Mountain View) - Wednesday, November 18


Deep Dive Into Spark Streaming (Bellevue) - Wednesday, November 18


Hadoop in the Cloud (Clayton) - Tuesday, November 17


Building a Real-Time Transformation Engine on Spark Streaming (Saint Paul) - Thursday, November 19


Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, November 16


First Miami HUG Meetup! (Coral Gables) - Thursday, November 19

North Carolina

Agile, Nimble, Tenacious: Modern Data Analytics with Apache Spark (Charlotte) - Wednesday, November 18


Practical Introduction to Apache Flink (Vienna) - Thursday, November 19


Connecticut Big Data #3 (Windsor) - Wednesday, November 18


Hands-On Presto Workshop (Boston) - Tuesday, November 17

Spark, Big Data and Analytics Meetup (Boston) - Thursday, November 19


Hadoop: Intro and Experiences with AWS Elastic Map Reduce (Montevideo) - Wednesday, November 18


Lets Get Started with Hadoop #4 (Oslo) - Thursday, November 19


Schedoscope: Pain-free Scheduling for Agile Hadoop Data Warehouses (Munich) - Tuesday, November 17


Lighting Talks: Flink Streaming, Spark Streaming, Control-M, Qlik (Warsaw) - Thursday, November 19


6th BigData/DataScience Cluj-Napoca Meetup (Cluj-Napoca) - Tuesday, November 17


Mumbai Spark Meetup 4Q2015 (Mumbai) - Saturday, November 21

Apache Spark: Introduction to Spark DataFrames/SQL and Deep Dive (Bangalore) - Saturday, November 21


Spark in Cloud, SparkR and Machine Learning, Demos from Spark Hackathon (Sydney) - Tuesday, November 17


Apache Spark Intro, RDD Basics and SparkSQL (Auckland) - Thursday, November 19