Data Eng Weekly

Hadoop Weekly Issue #50

29 December 2013

It was a somewhat quiet week given all of the holidays. But due to the upcoming new year, there are some year in review and 2014 prediction posts. There are also a few noteworthy technical posts and new releases. I hope that everyone has had a good holiday and has a happy new year!


The Cloudera blog has a meta-post with their top-10 posts of the year. There were a number of popular posts covering everything from Apache Zookeeper to Cloudera Impala to the Parquet format and more.

HDInsight, the Hadoop on Windows Azure service, has a single node "HDInsight Emulator" for development. This post walks through using the emulator for development and launching a cluster with the service to run MapReduce jobs.

InfoQ has an article on ParallelX, which is a system for utilizing GPUs during Hadoop jobs. The software, which is in beta, translates JVM bytecode to OpenCL code that is compiled to run on the GPU. Thus, existing code can run without modification. Many types of MapReduce jobs are I/O bound, but if you're doing CPU-intesive machine learning or other computation, ParallelX could be an interesting way to speedup your jobs.

It's fairly common to have a copy of your online/transactional database in Hadoop for offline analysis that's too resource intensive to run on a live system. This post has some hints for extracting data from SQL Server into HDFS/Hive.

Despite some distracting profanity, this article provides a good overview of connecting Tableau to ElasticSearch via Hadoop/Hive. Since Tableau requires JDBC/ODBC access (and ElasticSearch doesn't have an SQL interface), it's possible to use Hive ODBC and the ElasticSearch Hadoop plugin to connect the two. The article includes all the steps necessary to deploy the Hadoop part of things.

Silicon Angle has an article (which reads a bit sales-pitchy) on the RainStor database. As compared to other SQL-on-Hadoop solutions, Rainstore has some differentiating features like security/privacy (including encryption on disk) and compliance (including data disposal).


SDTimes has a recap of "Big Data in 2013." The story covers some of the changes in the Hadoop ecosystem -- from Hadoop 2.0 to the growth of Hortonworks to new Apache incubator projects in the ecosystem.

In another post recapping 2013, SearchDataManagement highlights five Hadoop stories from the year. The themes that are highlighted include Hadoop as a data warehouse, Hadoop 2.0, and the proliferation of Hadoop distributions.

InformationWeek has consolidated some big data predictions for 2014. The lists includes predictions from the CEO of Concurrent (makers of Cascading), the CEO of Pentaho, and several others. Most of the predictions are fairly obvious, although only time will tell if they come to fruition.


Cloudera Impala 1.2.3 was released. It's a minor release to fix a parquet compatibility issue with parquet files generated via MapReduce.

Hadoop Scalding NoJarTool is a development tool for running scalding jobs from the sbt shell without first assembling a fat jar.

Hadoop.Client is a .NET hadoop client. Version 0.1.0 was released -- it's being called an "early alpha" release.

The recently released Revolution R Enterprise 7 from Revolution Analytics features an integration with Hadoop MapReduce and Teradata. Supported algorithms include random forests, and the software is integrated with Cloudera and Hortonworks distributions.


Curated by Mortar Data (

Monday, December 30

Inaugural Tel Aviv Meetup (Tel Aviv-Yafo, Israel)

Saturday, January 4

Hadoop Meetup (Bangalore, India)