Data Eng Weekly

Hadoop Weekly Issue #66

20 April 2014

There were a number of announcements this week, including new Hadoop integrations from Microsoft, Google, and Amazon. Red Hat and Hortonworks announced the next step of their partnership, and Cloudera announced a new zero-download trial for CDH5. There are also some excellent technical resources including details on the Slideshare analytics stack and a peak under the hood of Hadoop operations at Spotify.


perf top is a tool for profiling Linux systems. This post explains how to use it with Java, and how to convert the output to a flame graph. It focusses on how to do all of this with Tez, but it is broadly applicable to any Java application.

Slides from the Hadoop Summit talk by Adam Kawa, Data Engineer at Spotify, about Hadoop operations were posted. Spotify is running YARN on a several-hundred node cluster, and the talk covers analyzing and understanding the usage of their cluster. Examples include analyzing NameNode GC, analyzing HDFS usage, HDFS capacity planning, auto tuning of MapReduce jobs and much more.

The SlideShare engineering blog has an in-depth post about migrating their analytics stack from MySQL and Ruby to HBase and Pig scripts. The post covers the technologies involved, including detailed design diagrams of their processing pipeline, Hadoop infrastructure, HA setup, and more. They also have details about their configuration tweaks, lessons learned, and much more.

The Databricks blog has some details about upcoming support for Java 8 lambda expression in the Spark API. There are a few examples showing how concise the API becomes with the new syntax, which will be supported in Spark 1.0 (targeted for release in May).

The Cloudera blog has an article about writing, building, and running a simple Spark application on CDH5. The source accompanying the post is available on github, and it includes implementations in both Java and Scala.


MapR has launched a new Developer Central with code samples and articles on best practices for Hadoop. Articles cover Hive, Pig, MapReduce, HBase, the Lamda Architecture, and more.

Using the advent of commercial cameras as an example, the Cloudera blog has a post exploring the legal and ethical ramifications of big data. It discusses privacy and transparency as well as the currently regulations and regulatory efforts.

Accumulo Summit is taking place June 12 in College Park, MD. The call for papers is open until April 30.

Allied Market Research has released a report about the Hadoop Market. The report predicts that Hadoop will be worth $50.2 billion by 2020. The release breaks this down into software and hardware, and there are geographic and further breakdowns in the full report.

Ovum analyst Tony Baer has an article about the recently announced 2.1 release of HDP. It talks about how the release is expanding outside the traditional Hadoop core with Storm for streaming, search, interactive SQL (via Hive + Tez), and Apache Falcon and Knox.

FICO, makers of predictive analytics software, has acquired Hadoop startup Karmasphere. GigaOm has more details on the deal.


Cassandra 2.0.7 was released. It’s a bug fix release containing over 50 resolved tickets.

Apache Pig 0.12.1 was released. The new version includes a number of bug fixes as well as documentation improvements and a new version of HBase.

Apache Phoenix (incubating) released version 4.0 (targeting HBase 0.98.1+) and version 3.0 (targeting HBase 0.94.4+). The new releases includes support for equi-joins through broadcast hash join, SQL views, and more.

This week’s update to Amazon Redshift adds support for copying data directly from an Amazon Elastic MapReduce cluster to a Redshift cluster.

Google announced preview releases of the Google BigQuery connector and Google Cloud Datastore connector for Hadoop. These implement the Hadoop InputFormat and OutputFormat APIs.

Ferry, the project for running development environments of various distributed systems in Docker containers, released a new version this week. The new release includes improvements for YARN and the ability to forward ports from a container to the host.

Microsoft announced support for Apache Avro via the Microsoft Avro Library. The library is optimized by building an in-memory expression tree which is compiled into IL code.

Sprunch is a new Scala API atop of Apache Crunch. It provides “Pimp My Library”-style extensions via implicits and class extensions. Compared to Scrunch (which is the Scala API part of Apache Crunch), the API aims to be minimalistic and is less than 90 lines of code.

Cloudera announced a new zero-download demo of CDH 5 called “Cloudera Live.” It provides access to the entire Cloudera stack for up to 3 hours at a time via the Hue web interface.

Kafkacat is a stand-alone application for reading from and writing to Kafka from stdin/stdout (a la cat). It’s a small, statically linked C program.

Microsoft announced a new product called the Analytics Platform System that allows queries across traditional SQL data warehouses and Hadoop.

Hortonworks and Red Hat announced the next step of their partnership by integrating OpenShift PaaS with the Hortonworks Data Platform. The integration allows OpenShift applications to run in Hadoop via YARN. There is an example project that runs a Python Flask server to serve data stored in HBase.


Curated by Mortar Data ( )



Data Science Stack Showcase (San Francisco) - Tuesday, April 22

Spark 1.0 and Beyond (San Francisco) - Wednesday, April 23

Large-Scale Machine Learning with Apache Spark (San Francisco) - Thursday, April 24

Big Data Developer Day (Los Angeles) - Saturday, April 26


Learnings from Running Spark at Twitter (Bellevue) - Wednesday, April 23

Seattle Scalability Meetup - Agile Data & Apache Tez (Seattle) - Wednesday, April 23


Introduction to Spark (Boulder) - Wednesday, April 23


Advanced Hadoop Based Machine Learning (Austin) - Wednesday, April 23


Teradata User Group Conference: Central-St. Louis Region Event Agenda (Saint Louis) - Tuesday, April 22


Impala - Straight from the Antelope's Mouth (Philadelphia) - Tuesday, April 22

Hadoop Users Group Pittsburgh April Meetup (Pittsburgh) - Wednesday, April 23


The Future of Data (Cambridge) - Tuesday, April 22


Presentation on Hbase viewer by Abraham Elmahrek - Hue developer (Tel Aviv-Yafo) - Wednesday, April 23


Large Scale Image Classification and Apache Spark for applied machine learning (Amsterdam) - Wednesday, April 23


PerfUG : Hadoop et HDFS : Stockage, Requêtage et Performances (Paris) - Thursday, April 24