15 November 2015
This issue is short and sweet, with coverage of Amazon EMR, Apache Apex, Apache Ambari, and more. Apache Spark, Apache Cassandra, and Apache Mahout all released new versions this week, and Succinct Spark is an interesting new project to keep an eye on.
Given the elasticity and capabilities (such as the S3 blob store) of the AWS cloud environment, AWS Elastic MapReduce has some unique features available. This article covers several of them—EMR's distinction between core and task nodes to support elasticity, the EMR FileSystem, S3DistCp, and more.
http://cloudacademy.com/blog/amazon-emr-five-ways-to-improve-the-way-you-use-hadoop/
This post from the Cloudera blog describes how to continuously ingest data into HDFS and Hive (using one minute batches) for querying by Impala. In addition to the basic concepts, the post describes how to productionize this setup by using staging tables (which are compacted daily) and a view over the compacted and active staging tables.
http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
This post describes how Apache Apex (incubating), the batch and steam processing platform, utilizes checkpointing for fault-tolerance. Checkpoint data is written to HDFS, either asynchronously or synchronously (depending on the delivery guarantees), which allows any node in the cluster to recover the state.
https://www.datatorrent.com/blog-introduction-to-checkpoint/
The IBM Hadoop Dev blog has a post describing how to create email alerts to track when the status of a service becomes UNKNOWN (which is the case during a YARN HA failover).
This tutorial covers setting up an Amazon EMR cluster with Apache Spark and Apache Zeppelin (incubating). Next, the post gives the steps (i.e. setting up an ssh tunnel) needed to access the Zeppelin web ui. From there, there are instructions for building a recommendation engine using the MovieLens dataset and a MatrixFactorization MLlib function.
This presentation describes best practices for several non-trivial features of Spark—RRD re-use, working with key-value data, Spark accumulators, and SparkSQL. The presentation also has a preview of some work underway in Spark MLlib.
http://www.slideshare.net/hkarau/beyond-shuffling-global-big-data-tech-conference-2015-sj
Typesafe recently announced that they'll provide commercial support for Apache Spark.
http://www.eweek.com/database/typesafe-launches-support-for-apache-spark.html
This post lists 10 free Hadoop tutorials ranging from short to multi-step and in both text and video form.
http://www.datasciencecentral.com/profiles/blogs/hadoop-tutorials
The call for speakers for Strata+Hadoop World London, which takes place in May/June of 2016, is open until December 11th.
http://conferences.oreilly.com/strata/hadoop-big-data-eu/public/cfp/425
Apache Spark 1.5.2 was released this week. It's a maintenance release with over 60 resolved issues.
http://spark.apache.org/releases/spark-release-1-5-2.html
Apache Mahout 0.11.1 was also released this week. The new version contains 10 bug fixes and several performance improvements—to Spark support, to dot product calculations, and to %*%
calculations.
The 3.0 release of Apache Cassandra contains a number of new performance optimizations, data storage savings, and developer enhancements. The new version was released this week.
https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces82
Succinct Spark is a new library for interacting with data in the Succinct distributed store. Succinct compresses and indexes input data, which can lead to massive speedups to Spark programs in certain situations. The introduction describes several ways that Succinct is integrated with Spark, provides example code, and describes performance benefits.
https://databricks.com/blog/2015/11/10/succinct-spark-from-amplab-queries-on-compressed-rdds.html
Curated by Datadog ( http://www.datadog.com )
Hive Contributors Meetup (Santa Clara) - Monday, November 16
http://www.meetup.com/Hive-Contributors-Group/events/226495286/
Intro to Apache Spark for Java and Scala Developers (Mountain View) - Wednesday, November 18
http://www.meetup.com/sv-jug/events/226109708/
One Hadoop, Multiple Clouds (Palo Alto) - Wednesday, November 18
http://www.meetup.com/cloudcomputing/events/226450900/
Kafka November Meetup (Mountain View) - Wednesday, November 18
http://www.meetup.com/http-kafka-apache-org/events/225592591/
Deep Dive Into Spark Streaming (Bellevue) - Wednesday, November 18
http://www.meetup.com/Big-Data-Bellevue-BDB/events/219852695/
Hadoop in the Cloud (Clayton) - Tuesday, November 17
http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/225046512/
Building a Real-Time Transformation Engine on Spark Streaming (Saint Paul) - Thursday, November 19
http://www.meetup.com/Twin-Cities-Hadoop-User-Group/events/226472910/
Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, November 16
http://www.meetup.com/Cleveland-Hadoop/events/225670548/
First Miami HUG Meetup! (Coral Gables) - Thursday, November 19
http://www.meetup.com/Miami-Hadoop-User-Group/events/226352443/
Agile, Nimble, Tenacious: Modern Data Analytics with Apache Spark (Charlotte) - Wednesday, November 18
http://www.meetup.com/CharlotteHUG/events/219153256/
Practical Introduction to Apache Flink (Vienna) - Thursday, November 19
http://www.meetup.com/Washington-DC-Area-Apache-Flink-Meetup/events/225769282/
Connecticut Big Data #3 (Windsor) - Wednesday, November 18
http://www.meetup.com/Connecticut-Big-Data/events/224851797/
Hands-On Presto Workshop (Boston) - Tuesday, November 17
http://www.meetup.com/bostonhadoop/events/226523527/
Spark, Big Data and Analytics Meetup (Boston) - Thursday, November 19
http://www.meetup.com/Big-Data-Developers-in-Boston/events/226659145/
Hadoop: Intro and Experiences with AWS Elastic Map Reduce (Montevideo) - Wednesday, November 18
http://www.meetup.com/Montevideo-BigData-DataScience-Meetup/events/226378357/
Lets Get Started with Hadoop #4 (Oslo) - Thursday, November 19
http://www.meetup.com/Oslo-Hadoop-Big-Data-Meetup/events/222558927/
Schedoscope: Pain-free Scheduling for Agile Hadoop Data Warehouses (Munich) - Tuesday, November 17
http://www.meetup.com/Hadoop-User-Group-Munich/events/225557422/
Lighting Talks: Flink Streaming, Spark Streaming, Control-M, Qlik (Warsaw) - Thursday, November 19
http://www.meetup.com/warsaw-hug/events/226348148/
6th BigData/DataScience Cluj-Napoca Meetup (Cluj-Napoca) - Tuesday, November 17
http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/events/226443828/
Mumbai Spark Meetup 4Q2015 (Mumbai) - Saturday, November 21
http://www.meetup.com/Big-Data-Developers-in-Mumbai/events/223301848/
Apache Spark: Introduction to Spark DataFrames/SQL and Deep Dive (Bangalore) - Saturday, November 21
http://www.meetup.com/Big-Data-Developers-in-Bangalore/events/226419828/
Spark in Cloud, SparkR and Machine Learning, Demos from Spark Hackathon (Sydney) - Tuesday, November 17
http://www.meetup.com/Sydney-Apache-Spark-User-Group/events/225866988/
Apache Spark Intro, RDD Basics and SparkSQL (Auckland) - Thursday, November 19
http://www.meetup.com/Auckland-Apache-Spark-User-Group/events/225893445/