Data Eng Weekly

Hadoop Weekly Issue #126

21 June 2015

Spark Summit was this past week in San Francisco, so there's a lot of great content and announcements related to Spark in this issue. Some highlights include an article about the Spark and Cassandra integration, support for Spark in Amazon EMR, and a large commitment from IBM to help develop and advocate for Spark.


This presentation about Parquet has all the details you need to get started with the format and incorporate it into your data pipeline. This includes an overview of the problems that Parquet solves, the basics of the format, a description of a typical data flow, and example usage in MapReduce, Scalding, Pig, and Hive / Impala.

This tutorial describes how to install Hue 3.8.1 with the Pivotal PHD3.0 distribution (which comes with Hue 2.6). Using pre-built RPMs from Apache BigTop, the post describes configuring access to HAWQ and HBase from Hue.

The Databricks blog has a great guest post by folks from Datastax about optimizing Spark with Cassandra. In addition to great Cassandra-specific optimizations (e.g. RDD/Cassandra inner joins, Cassandra-specific/efficient groupBy and column selection), the post has general Spark information like an introduction to Spark's architecture, a practical overview of RDDs, and generic Spark tips.

While RDDs are the workhouse of core Spark, DataFrames are a powerful alternative that's seeing a lot of momentum and improvements. These two presentations from Spark Summit give an overview of DataFrames and a look at some of the performance improvements that are in the works as part of Project Tungsten.

In the second part of their series on Hadoop/YARN logs, a post on the Altiscale blog covers DataNode, NodeManager, Oozie, the Resource Manager audit, and Hive logs. It describes how to interpret the various log files and includes several awk scripts for automated processing.

This post describes Intent Media's progression from raw MapReduce in Java to Pig to Cascalog to Spark. Spark has allowed them to make better use of machine learning (given all the algorithms in MLlib), and the second part of the post describes how they use Spark on Amazon EMR. The code for the post is available on github.

Apache Kafka is integrated with many real-time big data frameworks, like Spark Streaming, Samza, and Storm. In addition to providing input data, Kafka can also be used for output—making it somewhat akin to the HDFS of a stream data platform. This post describes this role in depth, several use cases for Kafka, and more.

Spring XD 1.2 includes the Spring Integration Kafka Adapter which adds a number of features atop of the Simple Consumer API (such as an offset manager and a ConnectionFactory). The Spring engineering blog describes some throughput testing that the team did in order to validate their implementation. Using a RackSpace OnMetal cluster with SSDs and 10GbE, they were able to achieve ~2 million events per second using a single thread (which nearly matches the throughput of the Kafka performance testing tools).

Apache Pig 0.15.0 was released last week with enhancements to the Tez integration and support for Hive UDFs. This post describes the Tez enhancements—improved scalability, the Tez UI and local mode, and auto-parallelism—in details. In relation to support for Hive UDFs, it notes that Hivemall (a machine learning UDF library) has been tested, and the Hivemall wiki includes documentation for getting started.

The Hortonworks blog has the fourth post in a series on diverse workloads in YARN. This post describes preemption in the YARN Capacity Scheduler—from the motivation for using preemption to the low-level preemption API to the main configuration settings.

H2O is an in-memory, parallel machine learning engine. This presentation demonstrates integrating Spark and H20, which results in so-called "Sparkling Water." Specifically, it shows using MLlib's Word2Vec algorithm with H2O's Gradient Boosting Machine to classify text (Craigslist ads).

One of the best ways to understand big data systems is to read the original research papers which describe the architecture, assumptions, and trade-offs of each system. This post has a list of 100 papers broken down into "architecture layers" like file systems, data stores, cooridination, and computational frameworks.

Speaking of research papers, the morning paper blog covered two relevant papers this week: "Spinning Fast Iterative Dataflows," which is the basis of Apache Flink, and "Discretized Streams: Fault Tolerating Stream Computing at Scale," which is the basis of Spark Streaming. Also, the Apache Tez paper from the recent SIGMOD conference was also published this week.

It can be really frustrating to get started with a new data system when all you want to do is query a simple dataset. In many cases this is complicated by setting up multiple distributed systems, but Apache Drill makes it easier by providing first-class support for json-data stored on the local file system (there are still plenty of gotchas, though, as this post describes). Taking advantage of this feature, this tutorial describes how to load json data into Drill and integrate with Tableau for visualization.


MapR has announced three new Quick Start Solutions built on Spark. Each Quick Start includes a software license, professional services, training and certification, and pre-built templates. The new solutions are Real-time Security Log Analytics, Time Series Analytics, and Genome Sequencing.

This is a good article summarizing some of the announcements and themes of the recent Spark Summit and Hadoop Summit. It also discusses the relationship between Hadoop and Spark (the latter won't replace the former) and the state (maturity, community) of the two projects.

Mesosphere and Typesafe have announced that Typesafe will be providing enterprise-class support for Spark on Mesosphere's Datacenter Operating System.

A guest post by MapR on the Databricks blog describes some of their recent learnings working with customers to build applications on Spark. These include choices of language (most developers want to use Java although data scientists like Python) and common use-cases (batch processing, etl, olap cubes, operational analytics).

IBM is throwing its weight behind Spark in a big way. At Spark Summit, they announced that they're committing 3,500 developers to work on projects related to Spark, contributing machine-learning libraries to Spark, working with partners to teach Spark to 1 million engineers and data scientists, and more.

If you missed Spark Summit, the Databricks blog has a recap of the key announcements, keynotes, community talks, and more.


Mario is a new tool for constructing data pipelines. Written in Scala and integrated with Spark, Mario provides nice tools to compose various steps in a pipeline to provide concurrency and isolation. The introductory post describes the architecture and API and compares Mario to Luigi, Cascalog, and Spark's ML pipelines.

Apache Hama, the high-performance bulk synchronous parallel (BSP) engine, released version 0.7 this week. The new version adds support for running Hama applications on Mesos and YARN, adds implementations of max-flow, k-core, ANN, and more, and improves performance of queueing and messaging.

Databricks, the hosted Spark+more data processing platform, is now generally available. Alongside general availability, Databricks adds support for Spark 1.4, notebooks for Spark Streaming, and improved commenting.

Apache Storm 0.10.0-beta was release this week. New features of the release include a secure, multi-tenant deployment (kerberos, ssl support, user isolation, and more), the first steps towards rolling upgrades, a new Flux framework for defining and deploying topologies using a YAML DSL, an improved logging framework, streaming ingest to Hive, Redis support, and much more. The release notes also mention the Twitter Heron project in the context of future work on the project.

Amazon EMR has added support for Apache Spark. In addition to supporting all the features of Spark, Spark on EMR adds support for the EMRFS for consistent access to data in S3.

Version 1.1.0 of the Kite SDK was released with several new features including automatic schema and partition detection of existing data, a "compact" command for merging small files, and support for datasets stored in S3.

Version 7.1 of Bright Cluster Manager for Apache Hadoop adds full support for Apache Spark, including installation and monitoring. The software is vendor-independent and can deploy Spark with or without HDFS.


Curated by Datadog ( )



Advanced Cassandra: Intro to PySpark (Santa Monica) - Thursday, June 25


Using Existing Math Libraries with Spark (Chicago) - Wednesday, June 24


Apache Spark & Oracle Big Data Strategy (Farmington Hills) - Monday, June 22


Apache Drill: Self-Service Data Exploration and Nested Data Analytics on Hadoop (Atlanta) - Thursday, June 25


Spark Deep Dive (Laurel) - Tuesday, June 23

Accelerating Hadoop Projects with the Cask Data Application Platform (Laurel) - Wednesday, June 24

New Jersey

Ted Dunning: Real-Time Recommendation Engine (Princeton) - Thursday, June 25

New York

Hortonworks Data Platform Roundtable (New York) - Tuesday, June 23

IRELAND An Introduction to Spark & Red Sqirl (Dublin) - Wednesday, June 24


Deenar Toraskar: Applying the Lambda Architecture with Spark/Spark Streaming (London) - Tuesday, June 23


Intro to Spark and Data Science Using Spark (Oslo) - Thursday, June 25


Introduction to Spark DataFrames (Lyon) - Monday, June 22


Flink Meetup: Data Flow vs Procedural Programming (Berlin) - Tuesday, June 23


From Big Data to Fast Data: An Introduction to Apache Spark (Milano) - Thursday, June 25


Apache HBase Operations: Practices & Considerations (Tel-Aviv) - Wednesday, June 24