Data Eng Weekly

Hadoop Weekly Issue #28

28 July 2013

The folks at Cloudera were really busy this week with (by my count) five releases (and one semi-private beta), including a new security-focused project called Sentry. This week's issue also contains technical articles about Cassandra, Kiji, Hive, and more. The summer lull seems to have disappeared, and the content is high-quality and plentiful. Enjoy!


Cloudera released a new open-source project called Sentry that is focused on enhancing Hadoop's security features. Sentry's feature-set includes: Fine-grained access control, i.e. the ability to use views in Impala and Hive to restrict access to rows/cols in a table; role-based administration to grant various groups different access levels to a dataset; multi-tenant administration to delegate particular datasets to different administrators; and integration with Hadoop Kerberos. This article also discusses the Sentry Architecture, which is extensible to projects other than Hive and Impala.

Cassandra 2.0 has support for lightweight transactions -- i.e. linearizable consistency. This post provides background on consistency models, the paxos distributed consensus algorithm, and the new transaction feature in Cassandra 2.0. Even if you don't use Cassandra, it's a good read about consensus in distributed systems.

Chris Harris from Hortonworks presented on YARN and Tez at the Data Science London meetup. The slides give a great overview of the motivation in moving from MapReduce to YARN/Tez, provide an architecture overview of both YARN and Tez, and present some experimental results using Hive 0.11.

Inmobi, the folks originally behind Apache Falcon, have blogged about their system Pintail, which complements Falcon by offering sub-minute access to datasets in HDFS. It has features like tailing a stream of data from multiple clusters, checkpointing a stream, partitioning a stream for throughput, and custom input formats.

Cloudera Morphlines, which is part of the open-source Cloudera Development Kit, provides a configuration-based system for light-weight ETL. At the SF Data Engineering Meetup, Wolfgang Hoschek presented on using Morphlines for "on-the-fly ETL." His presentation covers details of the current implementation, data type support, standard library, and plugin system as well as real-time ETL capabilities.

The Cassandra Query Language, CQL, is a SQL-like language for interacting with Apache Cassandra. The Cassandra 2.0 release is adding a lot of functionality to CQL -- ALTER … DROP, conditional updates/schema modifications, triggers, partial secondary index support, and more. This post details the language feature enhancements and other improvements in Cassandra 2.0.

The Kiji Project provides a collections of software built atop of Apache HBase for representing, storing, and querying data in an entity-centric fashion. This post explores the power of this type of representation versus the traditional data warehousing star schema. While the major benefit is the ability to store complex data (e.g. lists, nested records) about a particular entity, HBase also has builtin support for multiple versions of data, providing the ability to view the evolution of a particular entity.

Cloudera Manager, the cluster management software, received two new and important features in version 4.5 -- role groups and host templates. Together, these features make it easy to build a heterogeneous cluster. This blog posts walks through the steps required to setup a new host template and apply it to a new, second class of DataNodes being added to a Hadoop cluster.

Using a dataset from Uber available on Infochimp, Carter Shanklin presented on Spatial Analytics with Hive at the July Hive Meetup. The talk covers the new windowing features in Hive 0.11, the spatial frmaework for Hadoop from ESRI, and some results obtained with this analytics stack.


At the Hadoop-DC meetup, Joey Echeverria from Cloudera presented on Apache Accumulo and Cloudera. During the presentation, Cloudera announced a semi-private beta of Accumulo 1.4.3 on CDH4.3 and Accumulo-Pig integration. The talk also covers the history of Accumulo and its data model.

LinkedIn released an analysis in which it highlighted the 10 hottest startups in silicon valley. Cloudera topped the list, and Hortonworks is number 5.

Hortonworks highlights the growth in the Hadoop job trends along with the slight decline in SQL job trends… and suggests that Hadoop (and in particular Hive) are a good addition to the skill-sets of anyone familiar with SQL. They also have some pointers for getting started.

Hortonworks was named the Global 250 B2B company of the year by AlwaysOn. In the press release, AlwaysOn mentioned that Hortonworks is rising quite fast in the big data world and that they have 'set the blueprint for how organizations can benefit from big data with 100-percent open source Hadoop.'


Kiji, the toolkit for building applications on Apache HBase, released a new version of the Buri BentoBox (i.e. collection of Kiji software). This update provides compatibility with CDH 4.3 and support for profiling of operations performed above the HBase layer.

Cloudera Search 0.9.2 beta was released. The new version contains faceting improvements in HUE, JSON support for morphlines (the etl engine), and various fixes and performance improvements.

Cloudera released version 1.1 of Sentry, which provides access and authorization features for Hadoop. This version adds new authorization features to Hive and Impala from CDH 4.3. More details about the features of Sentry are discussed above in the Cloudera announcement.

Impala 1.1 and Cloudera Manager 4.6.2 were released. In addition to adding support for Sentry, Impala 1.1 adds support for view, support for SQL-89 joins, performance improvements, HBase improvements, and more. Cloudera 4.6.2 adds support for Impala 1.1 and fixes several bugs.

Cloudera ODBC 2.0 Connector for Qlikview was released. It supports Qlikview 11, CDH4.2's Hive, and Impala 1.0+.

DataStax Announced DataStax Enterprise 3.1 and OpsCenter 3.2. DatataStax Enterprise 3.1 includes Cassandra 1.2 and Solr 4.3, promises drastic performance improvements, and includes many new search features. The OpsCenter management software is also supposed to be nearly 10x faster.

Apache Mahout 0.8 was released this week. Mahout, which is a project for building scalable machine learning, includes a number of implementations that utilize Hadoop. Of note, this release includes StreamingKMeans on MapReduce and a utility to convert lucene indices to sequence files.