Data Eng Weekly

Hadoop Weekly Issue #49

22 December 2013

Next week is a short week for most folks in the US, and it seems like the community has been working hard to push out lots of technical posts and new releases. If you have some time off next week, there's plenty to catch up on. You can really see the momentum building for Hadoop in 2014.


TechCrunch has an article on the current SQL-on-Hadoop craze, including details on several systems focusing on low-latency for analytics. The article also talks about how Google's Dremel system was the inspiration for many.

Forbes has a good guide to choosing a Hadoop distribution. It includes a wide range of companies (notably missing is the Hadoop-as-a-Service contingent), and it has an interesting take on bucketing the providers into "builders," "completers," "embedders," and "customizers." It's a good take on the enterprise ecosystem.

Jay Kreps, who has done a lot of distributed system work at LinkedIn (including Azkaban, Kafka, Voldemort, and Samza), has written a post about logging as the central part of a modern distributed system architecture. While not directly about Hadoop, the post talks about loading from and to a batch system like Hadoop. This post is full of so many good ideas that I predict it'll be a recurring resource for years to come.

The Hortonworks blog has a post on wire encryption in Hadoop, covering the various types and how to configure them. In particular, Hadoop's encryption includes RPC (from client to server and between daemons), HTTPS (including 2-way SSL), encryption during the shuffle phase of MapReduce, and over JDBC when connecting to HiveServer2.

The Cloudera "vision" blog has a post about how they think about SQL-on-Hadoop. It walks through Cloudera's decision to build a new system (Impala) rather than doing incremental improvement on an existing system (Hive). The post also has details of the Impala dev process and how Impala fits into the Cloudera offering (e.g. how it complements Pig and Hive).


InformationWeek has an article about WibiData, who builds and supports the entity-centric WibiEnterprise system. WibiEnterprise provides a foundations for companies to build real-time recommendation engines from large data sets. This is one of the best overviews of WibiData's product offering that I've seen.

Datameer, makers of analytics and visualization software for Hadoop, have raised $19 million in Series D financing. Datameer plans to use the money to help meet demand and to expand internationally.

Cloudera and Amazon Web Services have reached a deal to support Cloudera Enterprise on AWS. Details of the partnership are a little vague, but it sounds like Cloudera and AWS will have a two-way support hotline in case they need to kick questions to each other. Cloudera will still sell enterprise subscriptions directly to customers.

GigaOm has a post from Shaun Connolly of Hortonworks on common Hadoop adoption patterns. The first is data refinery (filtering/distilling data), the second is data exploration (finding interesting patterns in the data), and the third is application enrichment (building features from data).


Qubole has partnered with Google Compute Engine to offer its Qubole Data Service. Qubole is a Hadoop-as-a-service platform that ships with an optimized Hive. Support for Google Compute Engine is in addition to Amazon Web Services, and the post shares some benchmarks on the two services.

Cloudera released version 1.2.2 of Impala. Even though this is just a patch-level release, it has a number of large improvements. In particular, the new version includes a cost-based join optimizer, statistics computation from within Impala, an implementation of cross join, and preliminary support for secure authentication. At the same time, Cloudera has released a new version of the Cloudera ODBC Driver for Impala. Impala 1.2.2 is also the first in the 1.2.x line compatible with CDH 4.

Hortonworks has announced a technical preview of phase 3 of the stinger initiative. As a reminder, the stinger initiative is Hortonworks project to make apache Hive 100x faster. The preview includes improved integration between hive and tez, the vectorizer query engine, some additional SQL support (IN/NOT IN/HAVING), and many other improvements. Along with the release, Hortonworks has open-sourced their benchmarking suite based on TCP-DS.

Version 0.55 of PrestoDB, the low-latency SQL query engine open-sourced by Facebook, was released this week. This release contains a number of new features/improvements, including speedups for CPU-bound workflows when using RCFiles, hash distributed aggregations, partial support for distinct operations, a range-based predicate pushdown, and much more.

Cloudera has announced support for Accumulo via its Real-time Delivery (RTD) Accumulo add-on to Cloudera Enterprise. The release is based upon Accumulo 1.4.3 (with a number of backported patches), and it runs on CDH 4.3. There is a beta integration with Cloudera Manager 5.0 beta.

Intel announced version 3.0 of their Hadoop distribution as well as version 2.0 of the Intel Graph Builder for Apache Hadoop. The 3.0 release includes a number of new features, including an enhancement to improve the speed of data encryption (one of the Intel distributions defining features) by up to 20x. The Intel Graph Builder has been completely rewritten as a number of Pig UDFs and macros.

The Kiji project has announced a new release of their BentoBox SDK. The new version targets CDH 4.x, and it includes a new Kiji Scoring Server for real time execution of trained models. The release also includes updates to a number of the projects in the ecosystem.


I added the wrong link for the article describing Cloudera Manager 5. Here’s the proper link.


Curated by Mortar Data (

Monday, December 23

Workshop - Data Science In the Cloud Using Amazon (Tel Aviv-Yafo, Israel)

Saturday, December 28

Data Science Trends: What to expect in 2014 (Hydereabad, India)