07 September 2014
While last week’s issue had posts covering a few common themes, this week’s issue has content for a wide number of topics. Those topics include: Spork (Pig on Spark), Hive (specifically the new Stinger.next initiative), and Presto. There is also some interesting news from established enterprise companies—Teradata has acquired Think Big Analytics, and Cisco has released management and monitoring software for Hadoop.
A project called “Spork” has been working to build a Spark-based execution engine for Pig. This post gives an overview of how that implementation was built, describes how to get started (it’s as simple as
pig -x spark), and what the current status is (passing 100% of tests, but not yet merged to Pig-trunk).
Hortonworks has announced “Stinger.next,” a continuation of the Stinger Initiative (which had a goal of speeding up Hive by 100x). The project has three goals—speed improvements to support sub-second queries, improved scalability, and improved SQL support for transactions and SQL-2011’s analytic functions. There are a few other enhancements planned, too, like materialized views and streaming ingestion of data. The project is split into three phases, which will deliver in 2H 2014, 1H 2015, and 2H 2015.
This post is a good introduction to Spark’s RRD APIs. It looks at how to translate some mapper and reducer functions from a traditional MapReduce implementation. Some concepts translate directly, but you’ll quickly see several new methods like
flatMap. The post also shows that a simple flow becomes much more complex if you need to do some setup or teardown (a la the MapReduce APIs
This blog series is an excellent overview and introduction to Spark, Pig, Hive, and Cloudera Impala. For each, it gives a brief introduction to the computation model and features of the framework. The coverage of Hive includes a walkthrough of new features from the Stinger Initiative—ORCFile, query planner improvements, Tez as a backend, and vectorization (there’s quite a good technical overview of each). The coverage of Impala is also quite interesting—it enumerates several reasons that Impala tends to be faster than Hive (e.g. JVM GC overhead, Impala is better at pipelining, pull vs. push of intermediate data, and more).
This post describes integrating recommendations built by Mahout with a search engine to solve the cold-start problem (i.e. how to recommend to a new user). Using some preferences collected from during registration, the system queries item-similarity data stored in the search engine to create recommendations.
Qubole’s Presto-as-a-Service exposes Presto, which is a SQL-on-Hadoop system from Facebook, as a hosted service. It operates on data stored in S3, which means that data isn’t local to compute nodes. In order to get better speed, Qubole has architected a caching layer for Presto which supports both in-memory and SSD-based caches. This post explains the implementation, which uses consistent hashing and takes advantage of the kernel’s file caching (rather than building their own in-memory store). The post also has some experimental results, which show a 10-15x speedup by enabling caching and switching to ORCFile (from text).
With MapReduce on YARN, there’s no longer a long-lived JobTracker. For inspecting job histories, the MR Job History Server was introduced. Recently, a more general-purpose implementation called the Timeline Server was conceived. The Timeline Serve supports frameworks other than MapReduce, such as Tez (Spark and MR support are on the way). This post includes an introduction to the Timeline Server, including an overview of the data it exposes (which is much richer than the MR Job History Server).
This post serves as a good example of how easy it is to inadvertently write a poor MapReduce job, even when using a higher-level framework like Hive. Specifically, the post describes the steps taken to discover that a Hive query was inadvertently doing a full cross-product. It also mentions how you might identify the cross-product from the EXPLAIN query plan output.
SequenceIQ recently announced Periscope, a system for auto-scaling YARN clusters to meet SLAs. This post introduces some of the terminology for Periscope—alarms and metrics. It describes some of the metrics, e.g.
LOST_NODES, how to build an alarm as using a REST API call, and how to set a scaling policy using the REST API.
This post gives a high-level overview of the steps needed to deploy Hadoop for ETL. It discusses using Apache Flume and Apache Sqoop to bring data into Hadoop, introduces the concept of “schema-on-read,” provides suggestions for which frameworks to use to do ETL, and gives an intro to workflow management (suggesting that Apache Oozie is often insufficient).
Databricks, the company founded by the team originally behind Apache Spark, have produced an infographic highlighting some of the progress made by Spark in the past year. It highlights how far the project has come in a year, particularly in terms of community growth (e.g. Spark 0.7 had 31 contributors, Spark 1.0 had 117).
This post describe five influential Google projects that have spawned open-source equivalents. The five projects are Google MapReduce (Apache Hadoop MapReduce), Bigtable (HBase), Borg (Mesos), Chubby (Zookeeper), and Dremel (Drill).
Infobright and MapR announced a partnership for joint deployment of MapR’s distribution and Infobright’s analytics platform.
DataStax, who sells commercial software for Apache Cassandra, made news this week with a giant Series E round of financing. The deal brings in $106 million (bringing the total investment to $190 million) and values the company at $830 million.
Teradata has acquired Think Big Analytics, a big data enterprise services company. Notably, Think Big Analytics helps companies integrate Hadoop and NoSQL with existing technologies. Thus, industry pundits are seeing this acquisition as Teradata embracing the non-enterprise DW big data ecosystem.
Amazon Web Services announced support for Hive 0.13.1 for their Elastic MapReduce (EMR) Hadoop-as-a-Service offering. In a secondary announcement, EMR has removed the 256-step limit from EMR clusters as part of the AMI 3.1.1.
Hivemall version 0.2 was released. It’s a stable release of the machine learning library for Hive, which is built using Hive User-Defined Functions. The 0.2 release follows five pre-releases, and a new 0.3 beta is also available. Hivemall supports functions for Classification, Regression, item-similarity, k-nearest neighbor, and feature engineering. It requires Hive 0.11 or later.
Version 1.1.1 of hbase-client, the async HBase client for Node.js was released. The release supports HBase 0.94.x.
Cisco has announced UCS Director Express for Big Data, which is a automation and configuration tool for Hadoop services. It also provides monitoring of physical hardware alongside of Hadoop services.
The folks at SequenceIQ have published Docker images for running Apache Phoenix (if you’re not familiar, Phoenix is a SQL engine that runs atop of Apache HBase). The post describes how to launch a container running Phoenix 4.1 on HBase 0.98.5, create some tables, and connect to the database using both the sql shell and using JDBC.
WANdisco released version 1.9.6 of Non-Stop Hadoop. The new version adds zoning (virtual clusters or sub-clusters) and support for rolling upgrades.
Qubole, who offers Presto (the SQL-on-Hadoop) as a service, has added a new auto-scaling feature. By inspecting statistics kept by Presto, the system determines when the cluster us under- or over-provisioned and automatically adds/removes nodes.
Curated by Mortar Data ( http://www.mortardata.com )
SD Big Data Monthly Meetup #3 (San Diego) - Wednesday, September 10
September SF Hadoop Users Meetup (San Francisco) - Wednesday, September 10
Real-Time Analytics with Storm by Ron Bodkin of Think Big Analytics (Los Angeles) - Thursday, September 11
Hunk for Hadoop (Houston) - Wednesday, September 10
Monthly SEMOP meeting (Southfield) - Tuesday, September 9
Cleveland Big Data at the Great Lakes Science Center (Cleveland) - Monday, September 8
Sep 11: Igniting Data Analysis with Apache Spark by Ryan Gimmy (Reston) - Thursday, September 11
Nikhil Kumar (SyncSort) on Converting SQL to MapReduce (Durham) - Tuesday, September 9
RDBMS on Hadoop? Yes, talk and hands-on session from Splice Machine (Flemington) - Tuesday, September 9
Get Hands-on with Big SQL on Hadoop (New York) - Wednesday, September 10
FREE EVENT! Hadoop and Mainframes: Crazy, or Crazy Like a Fox? (New York) - Wednesday, September 10
Machine Learning on the Azure Cloud Platform (New York) - Friday, September 12
A Detailed Look at R on Hadoop (Moscow) - Thursday, September 11
Apache Spark - In Memory Map-Reduce (Hyderabad) - Saturday, September 13