Data Eng Weekly

Hadoop Weekly Issue #80

27 July 2014

Two large pieces of news this week: HP and Hortonworks announced a $50 million investment in Hortonworks as part of an expanded partnership, and Apache Tez graduated from the Apache Incubator. Additionally, there were a number of interesting technical posts this week on Pig, MapR FS, SQL on Hadoop, HDFS, and more.


The Hortonworks blog has a post highlighting some of the new features of the recently released Apache Pig 0.13. The 0.13 release adds preliminary support for multiple backends (i.e. something other than MapReduce like Tez or Spark). The post talks about several new features, including new optimizations for small jobs, the ability to whitelist/blacklist certain operators, a user-level jar cache, and support for Apache Accumulo.

A post on the Pythian blog discusses how the small files problem, which is well-understood with HDFS and MapReduce, can also effect MapR FS in certain situations. It gives a brief overview of the MapR FS architecture, describes the problem, and suggests some best practices.

As the number of projects in the Hadoop ecosystem grows, understanding how all the pieces fit together becomes more challenging. This post from the rackspace blog tries to bucket the various components into six areas, and it gives a good introduction to each aimed at the beginner.

This post on the sonra blog is one of the most comprehensive and up to date overviews of the SQL-on-Hadoop space that I’ve seen. It covers all the latest announcements such as Hive on Spark and Spark SQL. The post also goes into details on Hive on Tez, Cloudera Impala, Presto, Apache Drill, and InfiniDB.

Testing distributed systems can be very hard, but there are good tools for doing so such as the Jepsen test framework. This post looks at applying a Jepsen test to HDFS High Availability via the Quorum Journal Manager. Results show that HDFS performs consistently under a network partition, although availability can suffer (as is expected).

This post serves as an updated guide for running MapReduce jobs that read from and write to Cassandra. It includes sample code for configuring the input and output formats, building the MapReduce job, and generating Cassandra Mutation objects to update the output database.

This presentation gives an overview of structor, which is a tool for building virtual Hadoop clusters with Vagrant. It describes the system architecture, which uses Puppet for provisioning Hadoop components. It also details the various configuration options and instructions for using the tool.

Flambo is a recently open-sourced Clojure DSL for Apache Spark. This post serves as a detailed introduction to the API by walking through how to generate TF-IDF for an example dataset.

The Apache blog has a post detailing the Apache Sentry project, which aims to offer fine-grained access control to data stored in Hadoop. This post looks as the Hive integration in particular, but there are also integrations with Cloudera Impala and Apache Solr. It discusses the authentication primitives such as privileges, roles, and groups as well as the policy engine and policy provider components.

Datanami has an article discussing enforcing SLAs on Hadoop clusters. It focuses on Pepperdata’s product offering, which does real-time monitoring of a cluster to do fine-grained enforcement of SLAs. Hadoop systems (like the fair/capacity schedulers) can be a bit coarse in enforcing SLAs, which causes some folks to go to extremes to guarantee SLAs (like building dedicated clusters). If you’re in this situation, you might want to hear more about Pepperdata.

The Pinterest blog has a post about their big data infrastructure that ingests 20 terabytes of new data per day for a total of around 10 petabytes. Pinterest is entirely in AWS and using S3 for storage. They use the Hive metastore as a source of truth, and they migrated from Amazon EMR to Qubole’s service (from which they’ve seen major benefits). The post also details how they provision the instances in a Hadoop cluster.

The SequenceIQ blog has a post on the YARN Capacity scheduler. It explores the internals of the scheduler, including the configuration and scheduler event loop. It takes a detailed look into each of the types of SchedulerEvents (e.g. node added/removed, app added/removed) that change the state of the scheduler.

This post describes document-level security for Cloudera Search, which is a new feature of CDH 5.1. Implemented by Apache Sentry, a Solr SearchComponent adds additional filterQueries based on the roles associated with a particular query.

In the second part in a series summarizing broad concepts from Hadoop Summit, the Hortonworks blog has a post about YARN. It discusses several themes that came out of the Summit regarding YARN, and it highlights seven related presentations.


Hortonworks and HP announced that they’re deepening their partnership, and HP is investing $50 million in Hortonworks. This investment joins the $100 million round that Hortonworks announced in March.

Apache Tez was promoted to a top-level project this week by the Apache Software Foundation. Tez entered the incubator in February 2013, and has seen contributions from employees of several companies, including Cloudera, Facebook, Hortonworks, LinkedIn, Microsoft, Twitter, and Yahoo.

MapR and Tata Consultancy Services announced a partnership this week. The two companies are offering joint products based on TCS’s data analytics/management solutions and MapR’s distribution.

GigaOm has a post about the rise of Spark and Tez as evolutionary replacements for MapReduce. It talks about how these frameworks fit in with YARN, Hive, and Pig, and the history of both frameworks.

The Gartner blog has a recap of some of the Hadoop-related investments that took place this week. It puts them into context of the wider DBMS/IT industry and adds some color to the HP investment into Hortonworks. It also discusses the push for global sale/support in many of these moves.


The Cloudera Oryx project is system for real-time machine-learning. This week, a reboot of the project, Oryx 2, was announced. The new version implements the lambda architecture for the large scale machine learning using Apache Spark for both batch and the speed layer (using Spark Streaming).

Oink is a gateway server to Apache Pig/Hadoop providing a REST API. Built at eBay, it was open-sourced this week. The main design goals include governance, scalability, and change management.

Avro 1.7.7 was released. The new version includes a Perl implementation of Avro, support for a DECIMAL type, schema validation utilities for Java, and more. It also contains several bug fixes.


Curated by Mortar Data ( )



Hadoop Talk: Details of Anomaly Detection in Big Data (San Jose) - Monday, July 28

Big Data, Docker, and Apache Mesos (San Francisco) - Wednesday, July 30

Spark Machine Learning Bonanza (Sunnyvale) - Wednesday, July 30


Seattle Scalability Meetup: Eastside Edition (Seattle) - Wednesday, July 30


Inaugural Elasticsearch Meetup (Minneapolis) - Thursday, July 31


An Introduction to Apache Spark and Mesos (Madison) - Tuesday, July 29


A Leap Forward for SQL on Hadoop (Chicago) - Wednesday, July 30

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS (Chicago) - Wednesday, July 30


Social Text-Analytics and Visualization Using Hadoop & Streams Computing (Bethesda) - Tuesday, July 29

North Carolina

Rethinking SQL for Big data – Don’t Compromise on Flexibility or Performance (Durham) - Tuesday, July 29

July CHUG: Matt Jones (CTS) on Protecting PII in the Hadoop/Analytics World (Charlotte) - Wednesday, July 30


Hadoop Demystified (Alpharetta) - Monday, July 28


Centralized Logging - Industry First Approach to HBase Fans (Jacksonville) - Tuesday, July 29


Presentation Corner - Couchbase & Query Engines in Spark (Toronto) - Monday, July 28

Introduction to Apache Hive (Ottawa) - Thursday, July 31


Hadoop 101 - Beginners Only! (Melbourne) - Tuesday, July 29


Spatial and Hadoop Integration with Netezza (Auckland) - Thursday, July 31