Data Eng Weekly

Hadoop Weekly Issue #32

25 August 2013

This week's biggest news is today's release of Hadoop 2.1.0, the first beta release of Hadoop 2 (including YARN). There were also some more minor releases (Avro, HBase), and a number of newly announced projects (BlinkeDB, Blueflood). Add in lots of great technical articles covering several components in the Hadoop ecosystem, and you have a great issue for this week!


Unlike a traditional RDBMS, Hive doesn't support foreign keys or other logical links between tables. This means supporting a traditional datawarehousing snowflake model (where a fact table is joined with many dimensions table) puts the impetus on the user writing a query. The folks at InMobi have developed Cube, a new abstraction on Hive, for automatically resolving joins between tables, automatically apply aggregation functions with group by clauses, and more. They have a lot more details about Cube, which has been proposed as a new feature in Hive, in their blog post.

Sqoop is a system for interfacing between a RDBMS and HDFS/HBase via MapReduce. A Sqoop import or export will launch a MapReduce job to initiate a parallel and fault-tolerant transfer between HDFS/HBase and the specified database. The developerWorks blog has thorough walkthrough of the Sqoop command-line and API, including examples of importing data into HBase.

The Kiji blog has a post about KijiSchema's EntityId API, which translates Java objects to/from HBase keys. It has advanced features like prefix-hashes to avoid hotspots, lexicographic integer encoding, and more. If you're developing an application on HBase, this framework solves a lot of design challenges that you'll face, and it's a good read about best practices even if you're not planning on using the framework.

An example of using Spark to convert data from Avro to Parquet and running an optimized job over the parquet data. The post gives some background on Spark, Avro, and Parquet before diving in, and it also includes sample code. Given Parquet's relative youth, this is one of the first posts that I've seen about someone using it, and it's great to hear it works without a new tool like Spark out of the box.

Mortar Data offers a Hadoop-as-a-Service product that runs on Amazon Elastic MapReduce and makes heavy use of Amazon S3. As they've developed their product, they've learned a lot about how to make the best use of S3 with Hadoop. They've shared their knowledge in the form of six tips in a recent blog post. If you're planning a new deploy in AWS or want to optimize your existing one, this list is full of great ideas.

Steve Loughran and Devaraj Das presented on Hoya, which is an implementation of HBase on YARN, at the August 20th HBase HUG. The presentation covers the features and goals of Hoya, whose major feature is on-demand HBase clusters (even multiple versions) on a YARN deploy.

The Cloudera blog features a written and video tutorial on configuring HUE for high availability. The post covers setting up two instances of HUE with Cloudera Manager with HAProxy as the proxy front-end for the HUE servers (but setting up a highly available HAProxy tier is out of the scope of the article).

BlinkDB is a new entrant into the SQL-on-Hadoop from folks at Berkeley's amplab and MIT's CSAIL. It has a fresh take on delivering results faster than MapReduce -- data is sampled to provide results with bounded error (either query time or sample error are specified as part of the query). It's striving to be compatible with the Hive metastore and HiveQL but provide answers 200-300x faster than Hive. It runs atop of Spark and is implemented in Scala.

The NameNode performs a lot of operations during startup, which can be a nightmare to debug when something goes wrong (or slow). Hadoop 2 features a new feature to provide more information about startup -- by starting the NameNode HTTP server early in the startup process and displaying detailed information about startup on the website. The Hortonworks blog has more info about what happens during NameNode startup and a tour of the new feature.

Monash Research spoke with Cloudera about Sentry as well as processing frameworks on Hadoop. In addition to MapReduce, SQL, and Search, it sounds like stream processing (Storm) and a graph database (Giraph) are potentially next in line for CDH. There's also a discussion about a lack of interest in MPI (which is used in scientific computing). The article covering Sentry also notes a lack of interest in cell-based authorization (aside from those in the government) a la Accumulo. Both articles are good reads.


InformationWeek has an interview with Hortonworks VP of Corporate Strategy, Shaun Connolly. The interview covers the recent CTO changes at Hortonworks (and notes that the core team is still there), Hadoop 2/YARN and Tez, and the difference in strategy between Hortonworks and Cloudera.

MapR announced that they're opening a new office in Stockholm to cover the Nordic portion of Europe. They've also announced two key hires for that region.

Monash Research recaps an discussion with Hortonworks about their business -- customer base, partners, and Hadoop use-cases. There are several interesting details, such as the major industry sectors of their customers and an estimated of cost of $900/TB/year for operating a Hadoop cluster coming from a Hortonworks customer.


Intel updated their distribution to include support for the Lustre File System. Lustre is one of the big two file systems in high-performance/scientific computing, and it appears that Intel is making a strategic move to get their distribution into the hands of that community (even if Hadoop isn't their main use case).

Avro 1.7.5 was released with C# support for Avro data files, improvements to ruby and afro-c compression handling, and a number of bug fixes and improvements.

A new version of Kiji was released with an updated KijiExpress. This version updates the Scalding dependency to resolve some issues with the Scalding matrix api.

Apache HBase 0.94.11 was released. It's the new stable build, and it fixes 39 issues. Users of 0.92.x and 0.94.x can do a rolling upgrade to 0.94.11.

Rackspace announced and open-sourced Blueflood, which is a metrics system built on Cassandra. It has similar features to Graphite and OpenTSDB, is written in Java and can run multi-tenant across datacenters.

The first beta release of Apache Hadoop 2 was released today -- version 2.1.0-beta. The release highlights include: api stabilization, binary compatibility with Hadoop 1.x, support for running Hadoop in Windows, HDFS Snapshots, and more. The Hortonworks blog has more details about the major highlights as well as details on moving to Hadoop 2 GA.

DataStax announced a beta release of their Python Driver (cassandra-driver on PyPI), which is built for CQL. The driver includes support for all the features in drivers for other languages, like prepared statements, support for multiple load balancing strategies, and tracing. CQL has been catching steam, and DataStax eventually hopes their driver becomes the de facto choice.

CVE-2013-2192: Apache Hadoop Man in the Middle Vulnerability, affecting the 0.23.x, 1.x and 2.x Hadoop lines was acknowledged. Users are urged to upgrade to the latest release of the given branch (0.23.9+, 1.2.1+, or 2.0.6+).


Curated by Mortar Data ( )

Monday, August 26
YARN - Next gen Hadoop/MR from Hortonworks and Big Data Platform from Talend

Thursday, August 29
Developing Applications with Hadoop 2.0 and YARN

Saturday, August 31
August Big Data/Hadoop Meetup