Data Eng Weekly

Hadoop Weekly Issue #128

05 July 2015

It was a long weekend in the US this week, but there's still quite a bit of great content. In addition to several technical posts on Spark and Drill, there are some great news articles and several releases. Specifically, the interview with AMPLab's Michael Franklin and the post on Hadoop ecosystem projects are both highly recommended.


The MapR blog has an introduction to Spark DataFrames that uses two sample data sets (eBay auctions and the SFPD Crime Incident Reporting system) to illustrate them. The demo shows two strategies for building a DataFrame from a text file—the first using the RDD.toDF() method and the second using the spark-csv library. It also demonstrates how to use explain() to see the physical plan for materializing a DataFrame.

This post entitled "The Tragedy of Tez" describes the weakness (which is also the strength) of Tez. Namely, Tez is close to the MapReduce paradigm, which makes it easy to integrate into existing projects but also doesn't offer as big a leap as other projects like Spark. There's a lot of background on the project in the post and the comments have some more discussion.

This post is a good introduction to Apache HBase. It describes the data model, how HBase scales horizontally, and more. There are useful visualizations to help understand how HBase compares to a RDBMS and how data is stored in HBase across rows, column families, and columns.

The Duchess France blog has an introduction to and collection of exercises for getting started with Apache Spark. The exercises are available both in Scala and Java 8, and they cover the core Spark API, Spark streaming (consuming the Twitter firehose), and the Spark DataFrame API.

The MapR blog also has two tutorials for Apache Drill. The first demonstrates how to use Drill to process delimited data in a file on disk, convert it to Parquet, and query it using the embedded Drill engine. The second looks at how to access data via Drill's ODBC drivers from Python, R, and Perl.

The JW Player blog has a post on their data platform, and how they've overcome massive skew in their main MapReduce processing. JW Player collects data via nginx, loads it into Kafka, re-encodes data as Avro, aggregates using MapReduce, and writes output to archival storage. The aggregation step collects statistics at many different layers, such as geo/video/device, and the makeup of high-level networks resulted in major skew. The post describes their current solution for this problem.

The Cloudera blog has a post with several practical suggestions for deploying Kafka. It covers topics like SSDs, encryption, the role of Zookeeper for storing offsets, cross-data center replication, suggested configurations for compression/brokers/consumers, and more.

Version 4.0 of Apache Phoenix added a phoenix-spark module for reading and writing data stored in Phoenix via Spark. This walkthrough shows an example of using the functionality by computing PageRank using Spark's GraphX on data in a Phoenix table (and writing back the results).

The Amazon Web Services blog has a guest post describing how Expedia integrated AWS Lambda, DynamoDB, EMR, and S3 for data processing. As data arrives in S3, a Lambda routine updates state in DynamoDB and potentially triggers EMR jobs. There's a good overview of the Lambda javascript routine and how to deploy the project. This seems like a pretty compelling and interesting pattern for folks that are in AWS.

One of the novel capabilities of MapReduce is the flexibility to crunch over many types since input is stored in arbitrary files. A traditional RDBMS, though, requires table definitions before any data is loaded. There are advantages to both strategies—schema on read and schema on write—and this post describes and contrasts both in detail.


The IBM Big Data blog has a recap of a recent #SparkInsight CrowdChat. The post contains popular answers from questions relating to what Spark is (and what the important parts are) to how mature Spark is and how it's likely to evolve.

S C A L E has a two-part interview with AMPLab co-creator and UC Berkeley professor Michael Franklin. In the article, Michael discusses the origins of AMPLab, what has positioned it to spin off successful projects and companies, why he thinks Spark has taken off so well, plans for making machine learning easy to use, how database architecture has changed in recent years, the importance of SQL, and much more.

The Hortonworks Gallery is a collection of Ambari views/extensions, big data tutorials, and sample big data applications. The gallery is open-source and powered by github so users are encouraged to contribute via pull requests. Some initial entries include an Ambari extension for deploying OpenTSDB, a tutorial for Apache Spark, and a real-time data ingestion sample application.

Gartner is trying to help pin down the definition of Hadoop (which is something that I struggle with a lot for content in this newsletter). To that end, this post describes the expansion of the number of projects in the ecosystem, and it contains a matrix of which 39(!) projects/products are supported by six of the major vendors.

The agenda for the upcoming Strata + Hadoop World NYC, which takes place from September 29-Oct 1, has been posted. There are three days of training and two days of keynotes and sessions.


Apache Hive 1.2.1 was released this week. The new version contains a number of bug fixes and some performance improvements.

Corc is a new open-source project from for reading and writing data in the Apache ORC file format from within Cascading. The implementation supports all ORC types, optimizations like column projection and predicate pushdown, and can read data from Hive's ACID datasets.

Apache Falcon, the feed management and processing system, released version 0.6.1, which is the first release since Falcon became a top-level project. The Hortonworks blog has a summary of key improvements in the release: the web-based user interface (build feeds with a UI rather than XML), Hive replication while preserving metadata (like views and annotations), and a new UI for Hive/HDFS replication.

Apache Tajo, the data warehouse system for Hadoop (and more), released version 0.10.1 this week. The release includes a number of bug fixes and improvements.

Apache Accumulo 1.5.3, the latest bug-fix release for the 1.5.x branch, was announced this week. Key changes include disabling of SSLv3 to secure against POODLE and several stability-related bug fixes.

Cloudera Enterprise 5.4.3 was released, with fixes for YARN rolling upgrades, a potential data loss bug, and a speedup for NameNode startup. The release also includes fixes to Cloudera Manager and Navigator.

GridGain announced GridGain Enterprise Edition v7.1 and GridGain Community Edition v1.1. The in-memory compute framework is powered by Apache Ignite (incubating), and the new version adds several new features. These include a mechanism to share state in-memory across Spark jobs, an integration with Mesos and YARN, and an integration with Apache Zeppelin (incubating).


Curated by Datadog ( )



Spark Trends and Spark Analytics (Los Angeles) - Wednesday, July 8

Semantic Indexing of 4 Million Documents with Apache Spark (San Francisco) - Thursday, July 9

SF / East Bay Area Stream Processing Meetup (Emeryville) - Thursday, July 9

DataFrame: Spark's New Abstraction for Data Science, by Reynold Xin of Databricks (Redondo Beach) - Thursday, July 9


An Introduction to Apache Drill (Addison) - Monday, July 6


Big Data Discovery: Leveraging Oracle GraphX DB and Spark! (Grand Rapids) - Wednesday, July 8


A Java Developer’s Companion to the Hadoop World (Atlanta) - Thursday, July 9


Hadoop: Big Data or Big Deal? (London) - Monday, July 6

Distributed Stream Processing (London) - Thursday, July 9