Data Eng Weekly

Hadoop Weekly Issue #112

15 March 2015

There were lots of releases this week, including new versions of Apache Hive, Apache Kafka, and Apache Spark. Speaking of Spark, it is the focus of many articles this week covering topics such as tuning Spark jobs, how well Spark scales, and the growing popularity of Spark.


In the early days, Spark had a reputation of not scaling to large clusters. This presentation addresses those concerns by mentioning some of the largest Spark deployments and describing how those companies are using Spark. Specifically, one company is running on 8000+ nodes and ingesting 1PB+/day.

This post describe looks at the anatomy of a Spark job and describes several best practices for writing performant jobs. Specifically, it illustrates how to choose the right operator during a shuffle or join (groupByKey, reduceByKey, cogroup, secondary sort, etc), how to avoid shuffles, and when more shuffles make a job faster.

Apache Ranger is a security system for Hadoop that enables fine grain access control and auditing for Hive, HBase, HDFS, and more. A post on the Hortonworks blog describes three use-cases involving Ranger and Hive—HiveServer2 with HDFS files, Pig and MapReduce via HiveServer2, and the Hive CLI. The post has several screenshots of the Range UI that demonstrate both the policy editor (for access controls) and the audit history.

The Cloudera blog has the second post in a series on HDFS recovery designs. The first part covered lease and block recovery, and this part covers pipeline recovery. Pipeline recovery refers to avoiding failure when writing data to HDFS, which involves streaming data to multiple data nodes (in a pipelined write) for durability. The post looks at three failure scenarios—during pipeline setup, data streaming, and closing the pipeline.

There have been a few posts in recent issues describing how to use Apache Kafka as distributed event log. This post describes how to implement one of those ideas: a materialized view from the contents of a change log written to kafka. The post describes a simple HTTP API and server written in Clojure and talking to Redis. In addition to interfacing with Redis, all operations emit records to Kafka. The post has an example consumer to replay the operations to build the materialized view.

Apache Spark 1.3.0 was released this week (more details below). One of the main features of the release is an improved integration between Spark streaming and Kafka that uses a write-head log to avoid data loss under failure. The accompanying docs of the 1.3.0 release have a great overview of the new feature, including how to configure, build, and deploy jobs using the integration.

This post describes how to provide passwords to a Sqoop job without specifying them directly as part of the command. Solutions include a wrapper script that forwards to stdin, a password file, and a custom PasswordLoader implementation.

In the second part of a three part series on anomaly detection in healthcare data, this post describes how to compute similarity graphs using Apache Pig. After cleaning up the data with some python scripts, the pig script finds pairs of healthcare providers that have a cosine similarity above 85%. The code for the post is available on github.


This post describe eight features of Apache Spark that have helped it gain traction in the big data ecosystem. Notably, there are a number of points around how Spark works well outside of Hadoop—folks are using Spark on data stored in the Lustre and GPFS file systems instead of HDFS, and Spark works well with Mesos and other workflow management systems.

This post gives a good tour of big data technologies in the Hadoop ecosystem. It's a follow-up to a similar post from last year, and it highlights several new projects that have gained traction. These include Phoenix (SQL on HBase), Kafka, Falcon, Mesos, Docker, and more.

Data governance is a key feature in many enterprise deployments of data systems. This post on the Cloudera blog discusses some of the challenges related to data governance in Hadoop. In particular, Hadoop clusters often have data in many different formats, allow data access from many different systems, and have a very large scale.

This article discusses the results of recent surveys on Hadoop, which suggest that Hadoop isn't helping companies generate the value or savings that they set out to create. With that said, the author is bullish that Hadoop is the future since companies ahead of the curve in terms of data volumes are using similar technologies.

MapR has announced a new free, on-demand training course for HBase. The course is called "HBase Data Model and Architecture" and includes lectures, hands-on labs, and more.

Given HBase's recent 1.0 release, this is a timely post on the state of HBase (based on conversations with folks at Cloudera). The post describes HBase adoption among Cloudera customers, gives some historic perspective related to HBase development, and discusses several items on (and not on) the HBase roadmap. In addition, there's a brief discussion of Phoenix and Trafodian, both which add SQL atop of HBase.

This post describes best practices and tools for automating Hadoop deployments, securing Hadoop, and building a search index atop of data in HDFS.

I don't think anyone would argue that Spark growth as a key part of big data processing technology. This post helps put that growth into context by gathering metrics from several sources—posts on stack overflow, posts on Hacker News, and Google Trends. There's also a look into how Spark is being used—SQL, Python, and Scala top the list among Databricks Cloud users.


This project provides base classes for writing tests for Spark (version 1.1+) jobs. It instantiates and tears down a local-mode Spark before and after each test.

Apache Hive 1.1.0, which contains over 300 fixed issues, was released this week. Highlights include improvements for Parquet (nested types, memory manager), Java 8 support, and improvements to Avro support.

Apache Lens (incubating) is a analytics tool which integrates with many backends—from traditional data warehouses to Hadoop. This week, version 2.0.1-beta-incubating was released, which is the first release as an incubator project.

Apache Tajo, the SQL-on-Hadoop project, released ersion 0.10.0 this week. Major features of the release are: direct json file support, HBase integration, an improved JDBC driver, and improved support for Amazon S3.

Pinterest has open-sourced their workflow management tool, Pinball. The tool is written in Python, provides a web UI, integrates with Hadoop as well as non-Hadoop jobs, and follows a master-worker paradigm.

Apache Kafka was released this week to address four critical issues in the release. The Confluent blog has a detailed description of the fixes.

Apache Spark 1.3.0 was released this week. The release contains a number of core improvements, the new DataFrames API, graduation of Spark SQL from alpha, several new algorithms for Spark MLlib, improved integration with Kafka for Spark Streaming, and more. The Spark release notes have a great overview of the new features.

ImpalaToGo is a fork of Impala that aims to remove its dependency on Hadoop, targeting data stored in Amazon S3 or the Tachyon file system.


Curated by Datadog ( )



Machine Learning at American Express (Palo Alto) - Wednesday, March 18

Apache Kylin: Extreme OLAP Engine for Big Data (San Jose) - Wednesday, March 18

New Workflows for Building Data Pipelines (San Francisco) - Wednesday, March 18

Spark Notebook and Rapture Workshops (San Francisco) - Thursday, March 19


Data Streaming Technology Overview (Denver) - Wednesday, March 18


Introduction to the Advantages of an EMC Data Lake (Louisville) - Thursday, March 19


Nashville Big Data & Hadoop Night (Nashville) - Thursday, March 19

North Carolina

Triad Hadoop Users Group (Winston Salem) - Thursday, March 19


Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, March 16

New York

Scalding @ Tapad: Gaining Consumer Insight from Billions of Data Points (New York) - Monday, March 16

Spark Summit East 2015 Warmup (New York) - Tuesday, March 17

Data Driven NYC (New York) - Tuesday, March 17

Nathan Marz: The Inherent Complexity of Stream Processing (New York) - Wednesday, March 18

Real-time Analytics with es-hadoop / Running Elasticsearch at iHeartRadio (New York) - Wednesday, March 18

Spark DataFrames and ML Pipelines for Large-Scale Data Science (New York) - Wednesday, March 18

Spark DataFrames + Spark on Google's GCP (New York) - Thursday, March 19

Spark MLlib: Making Practical Machine Learning Easy and Scalable (New York) - Thursday, March 19


Lambda-ssandra: Lambda Architecture Backed by Cassandra (Toronto) - Tuesday, March 17

Lessons Learned in Building a Spark Distribution (Montréal) - Thursday, March 19


Apache Spark: Living the Post-MapReduce World (London) - Tuesday, March 17


Druid at Criteo and History of Hadoop at Spotify (Stockholm) - Wednesday, March 18


Intro to Hadoop (Tønsberg) - Wednesday, March 18


Hadoop at Eficode (Helsinki) - Tuesday, March 17

Putting Apache Spark to Life (Espoo) - Friday, March 20


First Hadoop Meetup (Prague) - Thursday, March 19


Bright Spark + Oracle Big Data Discovery (Sydney) - Thursday, March 19

Hadoop+Strata 2015 San Jose Recap (Melbourne) - Monday, March 16

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit