Hadoop Weekly Issue #69

11 May 2014

Once again, this week’s newsletter has a number of good articles about Apache Spark, which continues to take the Hadoop ecosystem by storm. This week also has an interesting technical post on a new feature of Apache Ambari, as well as an in-depth interview with Cloudera CTO Amr Awadallah.

Technical

These slides from a presentation at Data Science Days give an intro to Apache Spark. The talk covers all of the key components of the Spark stack, including Shark (SQL), Spark Streaming, MLlib, and GraphX.

https://speakerdeck.com/mhausenblas/apache-spark-the-light-at-the-end-of-the-tunnel

The Cloudera blog has a post covering the design and deployment of YARN Resource Manager (RM) High Availability (HA). This topic was also recently covered by the Hortonworks blog, but this post covers Cloudera-specific parts (like Cloudera Manager integration) in addition to another valuable resource for understating YARN RM HA.

http://blog.cloudera.com/blog/2014/05/how-apache-hadoop-yarn-ha-works/

A post on the codecentric blog covers a new feature of Apache Ambari, the Hadoop cluster management software, called blueprints. Blueprints provide a mechanism for describing host configurations in a cluster independent of actual nodes. The post describes the feature in depth, including Ambari’s blueprint HTTP endpoints, an example configuration, and an example deployment using the Ambari API.

https://blog.codecentric.de/en/2014/05/lambda-cluster-provisioning/

This talk describes how Accumulo might someday get support for SQL queries via the Hive API and other SQL-on-Hadoop systems. It also talks about some of the problems that are unique to Accumulo, such as visibility labels and authorizations.

http://www.slideshare.net/DonaldMiner/sq-lon-accumulo

A talk entitled "Collaborative Filtering with Spark” covers using Spark’s MLlib to compute alternating least squares. After motivating a switch from interative MapReduce, the talk describes two failed attempts that suffered from bottlenecks, and also thee final solution that uses a custom partitioner.

http://www.slideshare.net/MrChrisJohnson/collaborative-filtering-with-spark

The MapR blog has a post with a number of questions and answers about the the features and status of Apache Spark. Of particular interest, it covers the status of Spark’s integrations with several ecosystem projects and tools like R, Apache Giraph, Apache Pig, Apache HBase, Apache Mahout, and Apache Mesos.

http://www.mapr.com/blog/let-spark-fly-advantages-and-use-cases-spark-hadoop-webinar-follow

News

Databricks and DataStacks announced a new partnership to integrate Apache Spark with Apache Cassandra.

http://databricks.com/blog/2014/05/08/Databricks-and-Datastax.html

MapR released information about their customer growth in Q1 2014. Specifically, they had record bookings with sales coming from ~90% subscription licensing. The press release also sheds some light on MapR’s customer base, which they promote as diverse both geographically and across industries.

http://www.marketwatch.com/story/mapr-reports-record-growth-in-the-first-quarter-2014-2014-05-05

Forbes.com has an interview with Cloudera CTO Amr Awadallah. The interview covers a lot of ground, including Cloudera’s vision for an Enterprise Data Hub, the Cloudera-Intel deal, vendor lock-in (and how Cloudera thinks about open-source), and competitors' business models.

http://www.forbes.com/sites/danwoods/2014/05/09/clouderas-strategy-for-conquering-big-data-the-enterprise/

Hortonworks and two partners announced that the partner products are certified for HDP 2.1. Specifically, version 5.5 of Rainstor and Syncsort’s DMX-h are now certified.

http://hortonworks.com/blog/rainstor-5-5-now-certified-hdp-2-1/ http://hortonworks.com/blog/dmx-h-syncsort-now-certified-hortonworks-data-platform-2-1/

The Qubole blog has a post of 10 “Hadoop Happenings,” which includes links and one-sentence summaries to several stories from the Hadoop ecosystem. There are both technical and industry-related articles in the post.

http://www.qubole.com/hadoop-happenings-guides-provider-overviews/

Releases

Apache HBase 0.94.19 was released, containing a number of bug fixes.

http://mail-archives.apache.org/mod_mbox/hbase-user/201404.mbox/%3C1398724015.43306.YahooMailNeo@web140603.mail.bf1.yahoo.com%3E

MapR and HP announced general availability of HP Vertica on MapR. The integration, which runs Vertica backed by the MapR FS, was originally announced in February.

http://www.mapr.com/blog/hp-vertica-analytics-platform-mapr-distribution-generally-available

Rackspace announced support for a new version of the Hortonworks Data Platform as part of their Cloud Big Data Platform (which is in limited availability). In addition to support for HDP 2.1, Rackspace has created a service to self-provision Data Nodes.

http://www.rackspace.com/blog/hortonworks-data-platform-in-the-cloud-3-clicks-to-cluster/

Qubole has announced that their Presto-as-a-Service offering is generally available after 4 months of alpha testing. Unlike many SQL-on-Hadoop implementations, Presto supports queries over data stored in S3 in the AWS cloud. Their blog post notes that Presto shows 2-7.5x speedups over Hive when running on data store in S3.

http://www.qubole.com/general-availability-of-presto-as-a-service/

Apache Accumulo 1.6 was released this week. The new release includes enhancements across scalability & performance, administration, development flexibility, encryption, and more. The Sqrrl blog has a summary of the release notes, which are pretty epic given that Accumulo 1.6 includes over 650 closed tickets.

http://accumulo.apache.org/release_notes/1.6.0.html http://sqrrl.com/announcing-apache-accumulo-1-6/

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

Invited Speaker Series: Malware Detection using Spark on MapR (Westlake Village) - Tuesday, May 13
http://www.meetup.com/Westlake-Village-Data-Science-Meetup/events/181408662/

May SF Hadoop Users Meetup (San Francisco) - Wednesday, May 14
http://www.meetup.com/hadoopsf/events/179881622/

Apache Drill: Building Highly Flexible, High Performance Query Engines (Palo Alto) - Wednesday, May 14
http://www.meetup.com/SF-Bay-Areas-Big-Data-Think-Tank/events/180365642/

Washington State

Spring 2014 Seminar Series: Big Data Infrastructure (Tacoma) - Wednesday, May 14
http://www.meetup.com/Tacoma-Big-Data/events/179062442/

xPatterns on Spark, Shark, Mesos, & Tachyon (Bellevue) - Wednesday, May 14
http://www.meetup.com/Seattle-Spark-Meetup/events/163550042/

Texas

Advanced Hadoop Based Machine Learning (Austin) - Wednesday, May 14
http://www.meetup.com/Austin-ACM-SIGKDD/events/171159732/

Data Curiosity Meetup – Hadoop & High-Performance Computing (Round Rock) - Thursday, May 15
http://www.meetup.com/Data-Curiosity-All-Things-Data/events/177496182/

Illinois

Choosing the Right Data Architecture for Your Big Data Project (Chicago) - Wednesday, May 14
http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/

Kentucky

Monthly Meetup - Big Data: Lions, and Tigers and Pig Ohh My (Louisville) - Thursday, May 15
http://www.meetup.com/Louisville-DotNet/events/181457352/

Ohio

Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, May 12
http://www.meetup.com/Cleveland-Hadoop/events/174665922/

Pennsylvania

May HUG Meetup (Pittsburgh) - Wednesday, May 14
http://www.meetup.com/HUG-Pittsburgh/events/178920432/

Georgia

Lars George - HBase (Atlanta) - Tuesday, May 13
http://www.meetup.com/Atlanta-Hadoop-Users-Group/events/181755392/

New York

Workshop for Beginners IV: Getting started with Spark and SparkSQL (Brooklyn) - Friday, May 16
http://www.meetup.com/New-York-Big-Data-Workshop/events/176635752/

UNITED KINGDOM

Hadoop presentation with Kim Fowler (Taunton) - Monday, May 12
http://www.meetup.com/Taunton-and-area-Developers-Meetup/events/170647382/

May Hadoop MeetUp (London) - Tuesday, May 13
http://www.meetup.com/hadoop-users-group-uk/events/180734022/

Data Eng Weekly