15 November 2015
This issue is short and sweet, with coverage of Amazon EMR, Apache Apex, Apache Ambari, and more. Apache Spark, Apache Cassandra, and Apache Mahout all released new versions this week, and Succinct Spark is an interesting new project to keep an eye on.
Given the elasticity and capabilities (such as the S3 blob store) of the AWS cloud environment, AWS Elastic MapReduce has some unique features available. This article covers several of them—EMR's distinction between core and task nodes to support elasticity, the EMR FileSystem, S3DistCp, and more.
This post from the Cloudera blog describes how to continuously ingest data into HDFS and Hive (using one minute batches) for querying by Impala. In addition to the basic concepts, the post describes how to productionize this setup by using staging tables (which are compacted daily) and a view over the compacted and active staging tables.
This post describes how Apache Apex (incubating), the batch and steam processing platform, utilizes checkpointing for fault-tolerance. Checkpoint data is written to HDFS, either asynchronously or synchronously (depending on the delivery guarantees), which allows any node in the cluster to recover the state.
The IBM Hadoop Dev blog has a post describing how to create email alerts to track when the status of a service becomes UNKNOWN (which is the case during a YARN HA failover).
This tutorial covers setting up an Amazon EMR cluster with Apache Spark and Apache Zeppelin (incubating). Next, the post gives the steps (i.e. setting up an ssh tunnel) needed to access the Zeppelin web ui. From there, there are instructions for building a recommendation engine using the MovieLens dataset and a MatrixFactorization MLlib function.
This presentation describes best practices for several non-trivial features of Spark—RRD re-use, working with key-value data, Spark accumulators, and SparkSQL. The presentation also has a preview of some work underway in Spark MLlib.
Typesafe recently announced that they'll provide commercial support for Apache Spark.
This post lists 10 free Hadoop tutorials ranging from short to multi-step and in both text and video form.
The call for speakers for Strata+Hadoop World London, which takes place in May/June of 2016, is open until December 11th.
Apache Spark 1.5.2 was released this week. It's a maintenance release with over 60 resolved issues.
Apache Mahout 0.11.1 was also released this week. The new version contains 10 bug fixes and several performance improvements—to Spark support, to dot product calculations, and to
The 3.0 release of Apache Cassandra contains a number of new performance optimizations, data storage savings, and developer enhancements. The new version was released this week.
Succinct Spark is a new library for interacting with data in the Succinct distributed store. Succinct compresses and indexes input data, which can lead to massive speedups to Spark programs in certain situations. The introduction describes several ways that Succinct is integrated with Spark, provides example code, and describes performance benefits.
Curated by Datadog ( http://www.datadog.com )
Hive Contributors Meetup (Santa Clara) - Monday, November 16
Intro to Apache Spark for Java and Scala Developers (Mountain View) - Wednesday, November 18
One Hadoop, Multiple Clouds (Palo Alto) - Wednesday, November 18
Kafka November Meetup (Mountain View) - Wednesday, November 18
Deep Dive Into Spark Streaming (Bellevue) - Wednesday, November 18
Hadoop in the Cloud (Clayton) - Tuesday, November 17
Building a Real-Time Transformation Engine on Spark Streaming (Saint Paul) - Thursday, November 19
Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, November 16
First Miami HUG Meetup! (Coral Gables) - Thursday, November 19
Agile, Nimble, Tenacious: Modern Data Analytics with Apache Spark (Charlotte) - Wednesday, November 18
Practical Introduction to Apache Flink (Vienna) - Thursday, November 19
Connecticut Big Data #3 (Windsor) - Wednesday, November 18
Hands-On Presto Workshop (Boston) - Tuesday, November 17
Spark, Big Data and Analytics Meetup (Boston) - Thursday, November 19
Hadoop: Intro and Experiences with AWS Elastic Map Reduce (Montevideo) - Wednesday, November 18
Lets Get Started with Hadoop #4 (Oslo) - Thursday, November 19
Schedoscope: Pain-free Scheduling for Agile Hadoop Data Warehouses (Munich) - Tuesday, November 17
Lighting Talks: Flink Streaming, Spark Streaming, Control-M, Qlik (Warsaw) - Thursday, November 19
6th BigData/DataScience Cluj-Napoca Meetup (Cluj-Napoca) - Tuesday, November 17
Mumbai Spark Meetup 4Q2015 (Mumbai) - Saturday, November 21
Apache Spark: Introduction to Spark DataFrames/SQL and Deep Dive (Bangalore) - Saturday, November 21
Spark in Cloud, SparkR and Machine Learning, Demos from Spark Hackathon (Sydney) - Tuesday, November 17
Apache Spark Intro, RDD Basics and SparkSQL (Auckland) - Thursday, November 19