Data Eng Weekly

Hadoop Weekly Issue #88

21 September 2014

There are a number of posts covering the recently released Apache Spark 1.1, Apache Drill 0.5.0-incubating, and Apache Tez 0.5.0. In addition, there’s a look at Hadoop in the healthcare industry, a look at ORCFile for non-Hive workloads, instructions for building a Hadoop setup on Mac, and more. The amount of content this week shows that we’re past the summer lull, and I expect to see lots more great content this fall.


The first of several posts on Apache Spark 1.1.0, this one covers Spark Streaming. New features of Spark Streaming in the 1.1.0 release include integration with Amazon Kinesis and high availability for Apache Flume. The post gives an intro to Spark Streaming, the new features, and talks through several example use cases.

Spark 1.1 also includes improvements to PySpark, which provides Python access to Spark. The main improvement is API support for arbitrary Hadoop InputFormats. This post includes an example of using SequenceFiles from PySpark and a walk-through of the new Converter trait, which is used to map custom types to POJOs.

This tutorial has instructions for setting up a single-node Hadoop cluster including HDFS, YARN, HBase, and Flume on a Mac. The instructions are for CDH but most of the content should be useful for any distribution.

This post is an introduction and overview of stream processing frameworks. It introduces use-cases (including a closer look at fraud detection), gives an overview of stream processing architecture, and intros several different systems (Apache Storm, Apache Spark, IBM InfoSphere Streams, and TIBCO StreamBase). The article also has a section on integrating with Hadoop and other data warehouses.

The Cloudera blog has a post summarizing a research project out of Zurich University to evaluate Cloudera Impala for mixed workloads. The post describes the use case, the report conclusions (namely that Impala scales linearly with more users), and includes a link to the full evaluation.

This post looks at using Apache Spark in two different ways. First, it shows how to compute summary statistics and other aggregations on a data stream. Second, it explores generating parquet files from Spark. After a few false starts (trying to use Scala libraries for Avro, Thrift, and Protobuf), the post shows how to use Java Protobufs as the write support for Parquet.

Apache Samza (incubating) is a stream processing framework that integrates well with Apache Kafka. Samza apps run on Apache YARN. This post on the LinkedIn blog describes how they use Kafka and Samza as a platform for distributed tracing (looking at the interaction between services in a service-oriented architecture). They describe a number of improvements to Smaza that have made to scale distributed tracing. The post also includes a description of using Samza to implement some foundational operations such as a cogroup.

These slides are from a presentation walking through the evolution of the data pipeline at Tapad. They first describe Tapad’s data challenges and then walk through the data pipeline, which eventually converged on Avro and Kafka as the core components. The slides include details about how Tapad moves data from Kafka to HDFS (rewriting as Parquet along the way) and uses Scalding to build MapReduce jobs.

ORCFile is a columnar storage format which comes with Hive. The format can be used with other systems, such as Cascading and Apache Crunch. This article provides an introduction to ORCFile, provides examples of using it with Cascading and Crunch, and provides some example benchmarks demonstrating impressive performance improvements.

Altiscale, as a Hadoop as a Service provider, has seen customers write and deploy Spark applications. They’ve put together some guidelines for when it’s worth considering Spark instead of MapReduce. Among them, when you need one of Spark’s machine learning or graph algorithm implementations or when existing MapReduce jobs are slow or are implementing iterative algorithms.


A post on SiliconANGLE looks at a number of surveys and reports about Hadoop adoption and usage. It tries to answer questions about who is using Hadoop, how it’s being used (e.g. SQL vs search), and in which industries Hadoop adoption is the strongest.

Apache Spark has gained a lot of momentum over the past year, and a lot of folks see it as an evolutionary replacement of MapReduce. A post on Datanami suggests three areas that Spark needs to improve in order for that to happen. The areas are high-end scalability (thousands or 10s of thousands of nodes), publication of successful case studies, and short- and long-term backwards compatibility (learning from Hadoop’s mistakes).

DataBricks and O’Reilly have announced a new certification program for Apache Spark developers. The program includes an exam that validates technical expertise in Spark. The first set of certifications will be done around Strata NY + Hadoop World in October.

This article looks at the role that Hadoop plays in the healthcare industry. Notably, while centralizing data in Hadoop has a lot of advantages for analysis and generating insight, it also adds security risk (since all your data is in one place). The article includes a brief look into how the Children’s Hospital Los Angeles is using Hadoop.

The Amplify Partners Data & Analytics Fellowship covers the cost of travel and registration to the upcoming Strata Conference. Applicants should “demonstrate passion and potential to meaningfully contribute to the field.” Amplify is giving extra consideration to individuals who do not typically have access to these types of opportunities. Applications are due on September 30th.

Cloudera and Dell announced that Dell has joined the Cloudera systems integrator program. More details on the program and Dell’s offering are in the press release.


Apache Tez 0.5.0 was recently released. It includes a number of improvements across APIs, documentation, security, and more. More details on the improvements of the release can be found in the original release announcement on the Tez mailing list as well as in a post on the Hortonworks blog.

Adding to recent buzz around Apache Tez, SequenceIQ has announced a new docker image to run Tez. It builds on the Ambari docker image, and there’s a script to deploy a multi-node cluster.

A new beta release of Apache Drill, version 0.5.0, was announced. Drill is a SQL-on-Hadoop (and data stored in other places) system. The new version uses Hadoop 2.4.1, has improvements for sorting when data doesn’t fit in memory, and several other improvements. The release resolves over 100 issues, and the Drill team is aiming to do monthly releases as they march towards GA.

After the recent release of Apache Spark 1.1.0, the folks at SequenceIQ have published a new docker image for that project. Running Spark in a docker container can be a great way of getting started, and the post has a few examples of running Spark jobs with it.

MapR announced that they’re including the recently released Apache Drill 0.5 in their distribution. An introductory post on the MapR blog provides a number of reasons why you might want to adopt Drill. At the top of the list is Drill’s ANSI SQL support, which eases integration with other systems. It also highlights several features that (as a whole) differentiate Drill, such as query without centralized schema, support for nested data, and out-of-the-box ease of use.

In addition to adding support for Apache Drill, the 4.0.1 release of the MapR Distribution includes updates to several ecosystem projects. Updates include Hadoop core 2.4.1, Spark 1.0.2, and HBase 0.98.4. Also, Apache Storm is certified with MapR 4.0.1 and Tez is in a developer preview.

Amazon Web Services announced a new EMR File System, which replicates metadata about the state of data in S3 to DynamoDB. By using DynamoDB, the EMR File System can provide a consistent view of data in S3, which is eventually consistent (particularly when doing S3 prefix listings). The post has some details on getting started with the EMR File System, which requires an initial sync command to initialize the data in DynamoDB.

EPIC is a new product from BlueData for provisioning Hadoop. It bundles KVM, RHEL, and cloud management from OpenStack. EPIC is certified with the Hortonworks distribution (HDP) and also supports Cloudera’s CDH. EPIC One, a single node version, is available now. The enterprise edition is still in beta but expected in Q4.

SequenceIQ has also released new docker images for Apache Hadoop 2.5.1. Their post features some examples for interacting with a container running the image.


Curated by Mortar Data ( )



Intro to Hadoop: Hype or Reality? You Decide, with Kevin Crocker (Palo Alto) - Tuesday, September 23

Women in Analytics September Event: Hadoop and Other DB Technology (San Francisco) - Thursday, September 25

How to Offload the ELT SQL in Your Data Warehouse into Hadoop Automatically (Mountain View) - Thursday, September 25

Discussion of Hadoop Use Cases vs. Runtime Environments, by Tom Phelan of BlueData (Los Angeles) - Thursday, September 25

Hadoop: Where Did It Come from and What's Next? by Eric Baldeschwieler (Pasadena) - Thursday, September 25


Getting Started with SQL on Hadoop (Seattle) - Tuesday, September 23

Seattle Scalability Meetup: Google, HWX, Zulily (Seattle) - Wednesday, September 24


EMR, S3, and Hadoop Use Cases (South Jordan) - Thursday, September 25


Large-Scale Analytics with Apache Spark (Saint Paul) - Monday, September 22


Tech Talk with Eddie Garcia, Info Security Architect at Cloudera (Southfield) - Tuesday, September 23


NKU Big Data Series: Hadoop 101 (Cincinnati) - Thursday, September 25


Jeff Holoman Presents on Cloudera Distribution of Apache Hadoop (Huntsville) - Wednesday, September 24


Centralized Logging: Industry First Approach to HBase Fans (Jacksonville) - Thursday, September 25

North Carolina

Making Business Decisions with SAS & Hadoop (Charlotte) - Wednesday, September 24


YARN (Pittsburgh) - Tuesday, September 23

New Jersey

Tableau Deep Dive: Big Data Visualization (Hamilton Township) - Tuesday, September 23


Full-Day Hadoop MapReduce Hands-On Workshop (Cambridge) - Friday, September 26


Hadoop User Group: YARN, Falcon, HBase... (Paris) - Monday, September 22


Third Spark Barcelona Meeting (CSIC) (Barcelona) - Monday, September 22


Real-Time Analytics Using Indexed MapReduce (London) - Thursday, September 25


Intro to Lambda Architectures & Development (Toronto) - Friday, September 26


SolrCloud, Solr + Hadoop 2 & Nutch Integration (Bangalore) - Saturday, September 27


HadoopKitchen (Moscow) - Saturday, September 27