16 February 2014
There were a lot of product releases and announcements in the ecosystem this week as folks met in Santa Clara for StrataConf. Among the highlights were announcements from MapR related to their distribution and a partnership with HP, a new beta of Cloudera’s CDH5, and the public preview of Hadoop 2 on Windows Azure. In addition, there are a number of interesting technical articles about HBase, MapReduce v2, Pig, Hadoop security, and more. Congrats to folks on all the releases, partnerships, and great articles. Also, a big congrats to Splice Machine for raising a new round financing.
The Cloudera blog has an interesting technical post about performance in MRv2. The post describes some of the major revamps that took place in MRv2, and it describes some performance regressions found by running the same jobs on both MRv1 and MRv2. The post walks through the low-level debugging that was done to identify the root cause of two of the issues, and it explains the fixes that were made. It’s a pretty technical overview including discussion of the
perf tool, CPU cache latency, fadvise, and more.
Understanding the HBase memory model, in particular how it caches data, is an important part of tuning an HBase deploy. This post walks through the two main parts of memory that HBase manages—the MemStore and BlockCache. It focusses on the implementation of the BlockCache, how the BlockCache speeds up queries, and gives a tour of the three BlockCache implementations shipped with HBase. The post is also annotated with in-depth technical details.
HiveServer2 is the latest and greatest way to interact with Hive. The new service provides JDBC and ODBC, and the new Hive CLI client,
beeline, connects to HiveServer2 via JDBC. Beeline introduces a number of changes (vs. the
hive cli) across several cli operations—specifying a connection, running in embedded mode, variable handling, and more. A post on the Cloudera blog has a detailed overview of the changes in beeline, which is essential knowledge for anyone looking to migrate.
Rounding out a trifecta of interesting technical posts this week, Cloudera elaborates on their reference architecture for running CDH in AWS. The post is a FAQ covering areas of the AWS deployment model such as VPC, security groups, subnets, instance types, and EBS. From personal experience, I can attest that a lot of the recommendations in this post ring true and are very valuable advice.
Apache Pig gained new functionality to compute CUBEs and ROLLUPs in version 0.11. Many data scientists and engineers working with Hadoop might not be familiar with these primitives, but they are pretty common in the data warehousing world. This walkthrough is a great introduction to CUBE/ROLLUP, which is illustrated by real examples in PigLatin.
The latest release of Parkour, the Clojure library for Hadoop, includes support for running Hadoop MapReduce jobs via the Clojure REPL. This tutorial walks through configuring an AWS Elastic MapReduce cluster to run queries form the Parkour REPL. The tutorial implements some non-trivial MR jobs on the Google Book n-gram corpus, and it also includes an example of writing tests in Parkour.
AMPLab’s Big Data Benchmark has been updated to include new versions of Impala, Hive (including Hive on Tez), and Shark. The results continue to show impressive numbers from Redshift, Impala, and Shark with Hive on Tez gaining ground. I’d suggest taking the results with a grain of salt, though, since they’re only targeting a single dataset and set of queries. But the benchmark is open-source, and you could use the scripts to recreate the evaluation with your own dataset and queries.
There are a number of companion tutorials to the Hortonworks Sandbox, a single-node Hadoop cluster VM. The RHadoop framework for running R on Hadoop is covered in a recently-contributed community tutorial. The tutorial shows how to use RStudio to run a MapReduce job that builds a model to predict visitors to a website based upon historic web logs.
The gartner blog has a post recapping a recent presentation by Square on encrypting data at rest in Hadoop. Square stores both redacted data and encrypted protobuf-serialized data, which fulfill 80% and 20% of their Hadoop workload, respectfully. This is one of the first home-grown encryption systems that I’ve heard of (although I suspect more folks are doing it). Work is in progress to bring similar functionality to Hadoop and HBase (Intel’s distribution has it already), but some folks obviously can’t wait for that to land.
MapR and HP have announced a partnership to bring HP Vertica to MapR’s distribution. Vertica, which is a MPP SQL engine, runs directly on the MapR file system and alongside MapR compute resources. Unlike other variants of SQL-on-Hadoop, it doesn’t seem that Vertica will tightly integrate with the ecosystem (e.g. it won’t read Hadoop file formats or use the Hive megastore), but Vertica is a much more mature system than anything else in the SQL-on-Hadoop realm.
Redhat and Hortonworks announced that they’ve expanded their partnership. The joint initiative includes support for the Red Hat Storage file system, the RHEL OpenStack Platform, Red Hat JBoss Data Virtualization, and further integrates the two companies’ support teams.
Slides and some videos from this week’s StrataConf have been posted online. There are a number of talks about Hadoop and related technologies from folks at Cloudera, MapR, Silicon Valley Data Science, and more. Forbes has a quick rundown of some of the highlights of the conference
There were a lot of partnerships and announcements in conjunction with StrataConf this week. GigaOm has a good wrap-up of the news including announcements from DataStax, a new tool from Alpine Data Labs, additional vendor support for Storm, and a patent award to Zettaset.
Splice Machine, who has built a transactional SQL engine atop of HBase, has raised $15 million in Series B funding. Noted in the announcement, Splice Machine will be offering a public beta in Q1 2014, which the company says has much better price/performance vs. Oracle databases.
WibiData, the company behind the open-source Kiji Framework, has announced a partnership with DataStax to bring Kiji to Cassandra. Kiji, which provides a so-called entity-centric API, currently supports HBase for data storage. Adding Cassandra will bring support for two of the most-deployed column-family databases in the Hadoop ecosystem. The announcement suggests support for KijiSchema and KijiMR will be released within a few weeks.
GigaOm recently hosted Cloudera CSO and co-founder Mike Olson the Structure Show podcast. They’ve extracted five key updates on the Hadoop landscape from the the ideas discussed in that show. The ideas include “At least part of the database market is safe” and “MapReduce will fade away as innovation flourishes.”
High-Performance Computing (HPC) clusters and Hadoop clusters are typically built with vastly different goals in mind. As a result, the underlying hardware and network topology tend to be very different (HPC often uses expensive, proprietary components whereas Hadoop uses commodity hardware). But there’s an interesting trend of running Hadoop on HPC. For example. the San Diego Supercomputer Center now supports launching a “personal Hadoop cluster” on Gordon, the worlds #88 HPC system. Many more details on the trend and the implementation in two articles on HPCWire.com.
GCN covers another place that Hadoop is gaining traction—inside of the US Government. In addition to migrations from NAS or SAN to HDFS, Hadoop enables cheaper and simpler network topologies.
Cloudera announced the second beta of CDH5. The updated beta includes lots of new features, including HDFS Caching, NFS Gateway, SSL encryption for Hive on non-kerberos clusters, and native Parquet support in Hive.
MapR announced support for YARN and Hadoop 2.x. In an introductory blog post, MapR describes their philosophy for supporting the new technology—allowing customers to use either MRv1 or MRv2/YARN. They also support both simultaneously (as well as other technologies like Storm) on the same cluster.
Windows Azure announced a public preview of Hadoop 2.2 in their HDInsight Hadoop-as-a-Service offering. In the announcement, Microsoft promotes the benefits of YARN, describes some of their work on the Stinger initiative, and highlight some example usages of HDInisight.
WANdisco has announced a new product called “Non-Stop HBase.” The product extends HBase to replicate regions in memory to improve latency in case of a RegionServer failure. The implementation, like their Non-Stop Hadoop implementation, uses a patented technology for which the implementation details aren’t public. But the claims of both consistency and continuous availability have raised some eyebrows.
Curated by Mortar Data ( http://www.mortardata.com )
Building Hadoop Data Applications with Kite (Palo Alto) - Tuesday, February 18
Samza: Reliable Stream Processing atop Apache Kafka & YARN by Sriram S./Linkedin (Los Angeles) - Tuesday, February 18
43rd Bay Area Hadoop User Group (HUG) Monthly Meetup - An Evening on Apache Tez (Sunnyvale) - Wednesday, February 19
February SF Hadoop Users Meetup (San Francisco) - Wednesday, February 19
Advanced Hadoop Based Machine Learning (Austin) - Wednesday, February 19
St. Louis Hadoop Users Group Meetup (Saint Louis) - Tuesday, February 18
Save the Date For Dean Wampler's Talk - Wednesday, February 19
Parkour: Hadoop MapReduce in idiomatic Clojure (Atlanta) - Tuesday, February 18
February Meetup (Pittsburgh) - Wednesday, February 19
Unlock your Hadoop Data with Apache Spark (New York) - Monday, February 17
Hadoop 2 with YARN, and Tez (New York) - Tuesday, February 18
Setting up a Hadoop Cluster on CentOS (Durham) - Saturday, February 22
Monthly Solution Architect Scrum (Toronto) - Thursday, February 20
HBase London - Feb meetup @Cloudera (London) - Monday, February 17
February Hadoop Meetup: Hadoop-as-a-Service & Zookeeper (London) - Tuesday, February 18
Spark! (Krakow) - Thursday, February 20
Test your Hadoop Knowledge (Hyderabad) - Monday, February 17