Data Eng Weekly

Hadoop Weekly Issue #65

13 April 2014

Apache Hadoop 2.4.0 was released this week. MapR announced a new beta of their distribution which includes support for Apache Spark, and Apache Spark 0.9.1 was released. On the technical side of things, we got an interesting view into one of the biggest Hadoop deployments—Facebook wrote about various parts of their Hadoop deploy and their adoption of ORCFile.


Facebook has posted about their Hadoop data warehouse, including how they’ve adopted ORCFile and the improvements that they’ve made to the open-source version of it. They’ve also postedthe source for their fork, which is 3x faster at writing, on github. The article is full of details about Facebook’s data such as ingest volumes and common data types (heavy use of json and strings).

The Cloudera blog has a guest post from SequenceIQ CTO Janos Matyas about how they use Morphlines to filter data flowing through Flume and perform ETLs once in HDFS. It uses a dataset serialized as JSON as an example.

With the dataset of command-line one-liners from, this post shows how to use Apache Spark to compute a common machine learning metric—collocation. The post describes how to use Spark with antlr to parse the commands, compute the raw frequency ranking, and do a significance test. The post also has a number of in-depth pros and cons of Spark for data science.

Apache Oozie has always had the ability to schedule coordinator jobs to run on a fix interval of every N minutes/hours/days/months, but finer-grained scheduling was complicated. Oozie recently integrated the Quartz scheduler to allow for cron-like syntax. This post on the Cloudera blog has more details and examples.

The Hortonworks blog has more details on the recently released Apahe Tez 0.4. Itcovers four of the notable changes—application recovery, Hive on Tez stability, data shuffle improvements, and Windows support.

Slides from folks at Continuuity discuss how to provide transactions over HBase. The talk discusses the motivation, and how they’ve implemented client-side transactions using a fault-tolerant transaction manager.

One of the new features in the 2.4 release of Apache Hadoop is Access Control Lists, which provide finer-grained permissions than the traditional Unix rwx bits. HDFS’s ACLs are modeled on the POSIX API and are enabled by a configuration change on the NameNode. The post contains some examples of using the hdfs command-line client to get and set ACLs.

Big Data & Brews has an interview with Pivotal Chief Scientist Milind Bhandarkar about PivotalHD. The interview is transcribed, and in it they talk about HDFS, HAWQ, MADlib, GemFire, and the other parts of the Pivotal distribution. It’s one of the best overviews of the Pivotal stack that I’ve seen.

Mike Stonebreaker, the creator of Postgres, Vertica, and several other database systems has historically been a Hadoop skeptic. In an interview with Datanami, he explains the weakness he sees in MapReduce and goes into his vision of the future of big data.

Apache DataFu (incubating) is a collection of Pig UDFs for data analysis and an incremental processing framework called Hourglass. Slides from a talk at ApacheCon give an overview of the framework including some examples like computing session statistics.


MapR has announced that they’re adding support for Spark. Their integration is based on a partnership with Databricks. The support incorporates the entire Spark stack into the MapR distribution, including Spark Streaming, Shark, MLLib, and GraphX. In a guest post on the Databricks blog, MapR has more information on their motivation for integrating Spark and partnering with Databricks.

The agenda for the Hadoop Summit taking place this June in San Jose was announced this week. The conference is three days this year, and talks are in six different tracks.

In a recent Gartner survey, just 3% of executives said they were expecting to replace their data warehouse with Hadoop. Rather, most companies are planning on deploying Hadoop alongside their data warehouse, using the right tool for the right job. This post explores the hybrid solution, and it details some of the shortcomings holding back Hadoop (mostly around SQL).

The Ovum blog has a post about Hadoop and SAS. It covers some of the upcoming integrations between SAS and Hadoop, its in-memory compute framework (and how it compares to Spark, Tez, and Storm), and how R is gaining ground on some of SAS’s features (notably scalability).

Qubole announced that their Hadoop as a Service product, Qubole Data Service (QDS) is now generally available on Google Compute Engine. QDS was in private beta since December.

The Apache Software Foundation celebrated the 5th anniversary of Apache Cassandra this week. A post on the ASF blog talks about some of the companies and projects that have adopted Cassandra, as well as the evolution of Cassandra over the past 5 years.


Apache Hadoop version 2.4.0 was released. This version has a has a number of improvements to HDFS and YARN such as HDFS ACLs, rolling upgrades for HDFS and high availability for the YARN ResourceManager. The Hortonworks blog has more details on the new release, and a preview of the features scheduled for version 2.5.0.

Version 0.9.0 of Adam was released this week. Adam is a processing framework for genomics using Apache Avro, Apache Spark, and Parquet.

Apache Spark 0.9.1 was released. It’s a maintenance release containing bug fixes, performance improvements and more. Notable features include improved stability of Spark-on-YARN, optimizations of the machine learning framework MLLib, and API parity work for PySpark.

Mortar announced that they’re open-sourcing their recommendation engine, which is used to build personalization at companies like MTV and Comedy Central. The framework is a mix of Pig and Java/Python UDFs.

MapR has announced the 4.0.0 Beta release of the MapR distribution. The release includes HBase 0.94.17, Pig 0.12, Oozie 4.0, and Hive 0.12. It also includes support for Spark, Tez, and Storm.

IBM has announced a new product called zDoop, a Hadoop offering for the System z mainframe.

Teradata announced Teradata QueryGrid, which offers a more complete integration between Teradata and Hadoop. Speciifcally, QueryGrid provides bi-directional data transfer between Teradata and Hadoop with the ability to pushdown processing. Hortonworks, who is a partner with Teradata, has more information about the new release on their blog.


Curated by Mortar Data ( )



Teradata User Group on Big Data, Analytics and Discovery (Calabasas) - Tuesday, April 15

Bay Area Hadoop User Group (HUG) Monthly Meetup (Sunnyvale) - Wednesday, April 16

Csaba Toth Presents, Hadoop Pig and Hive (Fresno) - Thursday, April 17

Mahout 1.0: Looking at the Future (Mountain View) - Thursday, April 17


Advanced Hadoop Based Machine Learning (Austin) - Wednesday, April 16


St. Louis Hadoop Users Group Meetup (St. Louis) - Tuesday, April 15

Washington, D.C.

Apache Sentry & Happy Hour at Local 16's Roof Deck! (Washington) - Monday, April 14

New Jersey

Big Data - Hadoop and Spark (Flemington) - Tuesday, April 15

New York

April 2014 - Clojure Meetup (New York) - Monday, April 14


Big Data Meetup, 2014/04 (Budapest) - Wednesday, April 16


BDNSHH April - Cloudera Impala (Hamburg) - Thursday, April 17


Spark Code Retreat (Nice) - Saturday, April 19