Data Eng Weekly

Hadoop Weekly Issue #109

22 February 2015

There was quite a bit of news this week with the announcement of the Open Data Platform, Pivotal open-sourcing several systems, and announcements related to Strata+Hadoop World. I've highlighted a few major announcements (there were too many to cover all in-depth), and I've also found a number of interesting technical articles covering Spark, Kafka, Cascalog, and more.


This post provides one of the best descriptions of a Data Lake that I've seen. It also talks about several common problems with, misconceptions of, and best practices for productionizing a data lake.

The O'Reilly Radar blog has a post describing several compute frameworks for Hadoop--everything from SQL to machine learning to real-time. The post describes the key considerations for choosing a framework and gives some guidance as to when to use each.

Apache Spark is adding a new DataFrames API, which is inspired by data frames in R and Pandas (Python). DataFrames are like a table in a RDBMS, but contain additional optimizations. In particular, materialization of DataFrames uses the Spark SQL optimizer and code generation framework. There are more details on the API, which is planned for Spark 1.3, in the introductory post.

The blog has a walkthrough of a new feature in Kite 0.18.0, which allows importing of data using custom InputFormats.

Answers is a near real-time mobile app analytics system built by Crashlytics/Twitter. The Twitter blog has a post describing the architecture of the system, which ingests billions of events per second. The system implements the Lamda architecture, using Kafka as the messaging layer, Storm for the speed layer, and EMR with Cascading for batch computation.

In last week's newsletter, there was mention of separating Spark from Hadoop. This week, Pinterest has written about just that--they're using Spark streaming with MemSQL for real-time analytics. The prototype system uses Spark streaming to take data from a Kafka topic, join it with dimensional data, and send the data to MemSQL.

The MSDN blog has a post about tuning performance of Sqoop jobs on Azure HDInsight. The suggestions are mostly distribution-independent (e.g. tuning number of map tasks, sizing the cluster and db properly), so it's a useful read if you're working with Sqoop.

The MongoDB blog has a tutorial on integrating MongoDB and Hive. The post describe how to use the MongoStorageHandler for Hive to query a Mongo-backed table.

This post how the components of the MapReduce API fit together and the role of each. Topics covered include InputFormats, RecordReaders, and OutputCommitters.

Netflix recently announced the Surus project, which is an open-source library of analysis tools for Pig and Hive. This week, they added the second function to the library: Robust Anomaly Detection (RAD). The Netflix blog has an overview of the goals of the tool, the algorithm it implements, and how it can be used via Apache Pig.

This presentation describes best practices for building a data architecture. It contains ideas like using Kafka as a data bus, directory layouts for datasets in HDFS, using Spark streaming, and schema management. Lots of tips for building a reliable and consistent system.

Cascalog, the Clojure library for Cascading, has recently added support for customer Hadoop counters (on master). This post describes how to update counters as part of a Cascalog job and how to access the counters programmatically afterwards.


The Strata+Hadoop World conference was this week in San Jose. Videos of the Keynotes and select interviews have been published on Youtube. Included in the list is a Keynote by President Obama and the U.S. Chief Data Scientist, Dr. DJ Patil.

TechTarget has an overview of the benefits of a Hadoop-powered data lake. The article looks at Allstate and Solutionary Inc, who have both recently created data lakes. Example benefits include the ability to look at country-level data (at Allstate) for the first time and using large-scale machine learning to identify when home inspections aren't necessary for a homeowners insurance policy.

Hortonworks, Pivotal, IBM, GE, Verizon, and others announced the "Open Data Platform" (ODP) this week. The goal is to standardize Hadoop ecosystems components and versions to ease interoperability across distributions. Companies such as Cloudera, which didn't join the ODP, have responded negatively to the announcement. There have been a number of articles about this topic, but I find the Gartner blog has one of the best takes on both sides of the argument.

Related to the ODP announcement, Pivotal and Hortonworks announced that they'll be "aligning efforts around Hadoop." As part of this, customers can choose to use either Pivotal HD or the Hortonworks Data Platform, and Hortonworks will provide advanced support for enterprise customers of both distributions.

Pivotal made another announcement this week which is easy to overlook given all the discussion around the Open Data Platform. The company is open-sourcing Greenplum, HAWQ, and GemFire database products (and still offering licenses and support). Greenplum is the company's analytics data warehouse, HAWQ is the SQL Engine for Hadoop, and GemFire is a in-memory distributed database.

Cloudera released information on company revenue and growth. They achieved ~100% year-over-year growth and over $100 million in revenue across 525 customers.

Datanami reports that Hadoop's lack of enterprise security features including fine-grained access control is limiting and sometimes preventing enterprise adoption. The post mentions some companies that are selling products to add additional security features.

Databricks and Intel announced a partnership to optimize Spark for Intel architecture. Intel's work on core Hadoop helped bring encryption-at-rest and other important features to the platform, so it should be interesting to see what comes of this partnership.

This post provides a recap of several themes that emerged at this week's Strata+Hadoop World. These include continued infatuation with Spark, security for Kafka, and a discussion around Spark streaming vs. Storm for stream processing.


Apache Cassandra 2.1.3 was released this week. The release contains over 100 fixes and improvements.

IBM announced several new modules for their BigInsights distribution. These include BigInsights Analyst (for integrating spreadsheets and visualizations with their SQL-on-Hadoop engine), BigInsights Data Scientist (for machine-learning on large datasets), and BigInsights Statistical Management (for managing resources and optimizing workflows).

Cloudera announced that Apache Kafka has graduated from Cloudera Labs and is now fully-supported as part of Cloudera Enterprise. A technical post on the Cloudera blog describes how to deploy Kafka using CDH and includes some guidance for choosing hardware and sizing a cluster. It also describes various details of the architecture, such as replication, partitioning, and how to guarantee message delivery.

Microsoft announced availability of HDP 2.2, which includes Apache Storm, as part of their Azure HDInsight Hadoop-as-a-Service platform. They also announced a preview of HDInsight on Linux, which uses Apache Ambari for deployment.

Hadoop-as-a-Service company Altiscale announced two new features this week. First, Apache Spark has been fully integrated into their platform. Second, they're now offering secure-mode for Hadoop using Kerberos.

Qubole has also added support for Apache Spark to their Qubole Data Services platform.

Tableau announced support for Spark SQL as part of the 8.3.3 release of Tableau. The connector is certified by Databricks.

MapR announced version 4.1 of their distribution. Key features include a bi-direction data replication between MapR-DB clusters in separate data centers, a POSIX client for loading data into MapR FS, and a new C API for MapR-DB.

Cloudera has released version 1.1 of Cloudera Director, their tool for provisioning CDH clusters in AWS. This release includes support for dynamically-resizing a cluster and an integration with Amazon's RDS (database-as-a-service). The Cloudera blog has more details and enumerates features planned for the future.

Apache Gora is an in-memory data model and persistence framework for Apache HBase, Apache Cassandra, and several other data stores (both k/v and RDMBS). This week, version 0.6 was released. The release updates dependencies for several of the dependencies (HBase, Avro, Hadoop, and more) that it supports.

Druid, the time-series database open-sourced by Metamarkets, recently switched from the GPL to the Apache license.


Curated by Datadog ( )



Going from Hadoop to Spark: A Case Study (San Jose) - Monday, February 23

PredictionIO DASE Architecture with Spark MLlib (San Francisco) - Tuesday, February 24

The Lambda Architecture (Sunnyvale) - Wednesday, February 25

Hadoop Multi-Tenancy Panel Discussion (Sunnyvale) - Wednesday, February 25

Hadoop RDBMS (San Ramon) - Wednesday, February 25

Apache Drill: A Schema-free SQL Query Engine for Hadoop and NoSQL (Oakland) - Wednesday, February 25

What the Spark!? Intro and Use Cases (Mountain View) - Thursday, February 26

Introduction to Hadoop Security, with Roman Shaposhnik (San Francisco) - Thursday, February 26


Intro to Apache Spark (Portland) - Wednesday, February 25


Apache Storm Tech & Usecase (Troy) - Monday, February 23

Hadoop Usergroup Kickoff Meeting (Lansing) - Tuesday, February 24

North Carolina

Modern Data Integration: Paradigm Shift (Charlotte) - Wednesday, February 25


Rapid Prototyping in PySparkStreaming (Arlington) - Tuesday, February 24

Spark (Richmond) - Tuesday, February 24

Let's Talk Hadoop Operations (Dulles) - Wednesday, February 25

Big Data Security Analytics with Apache Spark and GraphX (Vienna) - Thursday, February 26

Apache Spark & Real-Time Analytics (McLean) - Thursday, February 26


Apache Spark and Amazon Workshop (Hanover) - Tuesday, February 24

New York

An In-Memory RDBMS as an Alternative to Storm (New York) - Wednesday, February 25


3 Spark Talks (Cambridge) - Monday, February 23

Spark 0 to Prod in 30 days; Leverage Hadoop 2.0 and YARN with Native Tools (Boston) - Tuesday, February 24


February Meetup: Open Presentation Sessions (Toronto) - Monday, February 23

IRELAND Hadoop Introduction, Use Cases, Case Studies & Distributions (Dublin) - Monday, February 23

ENGLAND Self-Service Data Exploration with Apache Drill (Manchester) - Wednesday, February 25


Spark Coding Dojo: Scala (Barcelona) - Thursday, February 26


Apache Kafka + Zookeeper = 2 Million Writes per Second (Hyderabad) - Saturday, February 28

Session on MapReduce with Python and Amazon EMR (Pune) - Saturday, February 28


High Performance Analytics on Top of Hadoop (Sydney) - Tuesday, February 24