Data Eng Weekly

Hadoop Weekly Issue #149

13 December 2015

The two themes for this week seem to be comparisons and compatibility. On the former, there are articles comparing Spark to Drill and Hadoop as well as Hive to MySQL. Regarding compatibility, MapR announced MapR Streams, which is API-compatible with Kafka, and there's a post on Flink's recently announced support for Storm topologies. In addition, Confluent Platform 2.0 is out, and this week's issue has the first (probably of many) articles looking back on the year of Hadoop. Finally, SCALE has an interview with Doug Cutting, which ties many of these topics together in the context of open-source and the evolution of the Hadoop ecosystem.


While I wouldn't consider Amazon Redshift a part of the Hadoop ecosystem per se, it's often used in conjunction with Amazon EMR or other cloud-based Hadoop solutions. With that in mind, here are some useful tips for optimizing a Redshift cluster for optimal performance.

The MapR blog has a comparison of Apache Drill and Apache Spark. It notes that the major difference is that Drill is SQL-first, while Spark supports several query mechanisms, of which SQL is one. On the topic of this difference, Drill supports additional SQL features, like ANSI SQL, keywords for nested and array data (which is useful for querying JSON), and views.

This post describes the Hive architecture, schema-on-read, schema-on-write, and some recommendations on when to use Hive and when to use MySQL.

Apache Flink 0.10 added beta support for compatibility with Apache Storm. Using this support, a Storm topology can be run as-is on Flink (it must be converted to a Flink topology, though, which requires changes to a few lines of code). In addition, existing Storm Spouts and Bolts can be embedded inside of a Flink topology. This post describes the integration and gives examples of both features.

This presentation describes how the team at Magnetic has scaled Spark. The slides are somewhat sparse, but they mention how Magnetic is using AWS (they're slowing migrating from a colo to there) with details on instance types and auto-scaling.

This presentation describes how Treasure Data does data analytics. As a Ruby shop, Treasure Data uses a mix of languages for their platform. For collecting data, they using fluentd and embulk, and they use Hive and Presto for much of their processing. The presentation describes how they coordinate processing (e.g. PerfectSched and PerfectQueue) and describes several other tools they use (such as MessagePack).

Cloudera CDH 5.5 has support for Apache HTrace (incubating), which can provide granular details about timings of HDFS operations. This post describes how to setup HTrace and htraced (from Cloudera Labs) to record this information and view it with the included web front-end.


ReadWrite web has an article arguing that Hadoop and Spark will continue to coexist for the foreseeable future. Reasons include Spark's lack of a file system (Hadoop provides HDFS) and Hadoop YARN, which can provide a platform for other compute frameworks including various SQL-on-Hadoop systems.

SCALE has an interview with Hadoop creator Doug Cutting. Topics covered include defining Hadoop as a collection of projects, the addition of Kafka and Spark to this core collection, the rise of Spark across several industries (and with many companies behind it), monetizing open-source big data projects, and the comparison between open-source and proprietary technology from Google.

PCWorld has an article comparing Hadoop and Spark, which reiterates that the two complement each other well. But the article also describes some of the ways they're different (Spark is faster, different failure recovery modes) and that they can be used independently.

Apache Kylin, the OLAP big data system for Hadoop, has graduated from the Apache Incubator. The release notes that Kylin is used by several big companies, such as eBay and Meituan.

The MSDN blog has a list of resources about using Azure for data science. In addition to several articles and tools (such as HDInsight for running Hadoop), the post highlights the Azure for Research Award program under which academic and research institutions can apply for research awards of Azure resources.

This is the first of (likely) many posts reviewing 2015 and looking ahead to 2016. It highlights the rise of Spark, the shift towards SQL (and a couple of new SQL-on-Hadoop engines), the rise of highly scalable machine learning libraries, and more. Looking ahead, the author predicts that appliances and cloud will drive Hadoop adoption, integration of machine learning into analytics tools will improve, and data lakes will start to grow in number.


Hortonworks has announced support for Apache Spark 1.5.2 for their distribution. The 1.5.x release line has big speedups for the DataFrame/SQL system, several improvements for Machine Learning APIs, improvements to Spark Streaming, and more.

Confluent has announced version 2.0 of the Confluent Platform, which packages Apache Kafka 0.9. The new version includes improvements to security, new Kafka connectors (for streaming data into and out of Kafka to/from sources like HDFS and JDBC), new and improved clients, and more.

MapR has announced MapR Streams, which is a new streaming product that's integrated MapR's existing data platform. MapR Streams provides the Kafka API and is compatible with Spark Streaming, Storm, Flink, and Apex.


Curated by Datadog ( )



Using Spark, GraphX, and Zeppelin to Analyze Clickstream Data (San Francisco) - Monday, December 14

Faster Than Parquet! A Deep Dive into Kudu (San Francisco) - Tuesday, December 15

#OCBigData Holiday Party 2015 (Irvine) - Wednesday, December 16

In-Memory Computing with Apache Ignite (Sunnyvale) - Wednesday, December 16


Going from Hadoop to Spark, Kept Simple (Houston) - Thursday, December 17

Houston's 1st Spark Meetup (Houston) - Thursday, December 17


Interactive Data Analytics with Flink and Zeppelin (Chicago) - Tuesday, December 15


Modern Data Management Practices (Alpharetta) - Wednesday, December 16

North Carolina

Kudu: New Hadoop Storage for Fast Analytics on Fast Data (Charlotte) - Wednesday, December 16

New York

Data Driven NYC #42 (New York) - Monday, December 14

Hadoop & Spark Panel Discussion (New York) - Monday, December 14

Greenplum Database: The First Open Source Data Warehouse (New York) - Wednesday, December 16


Toronto Apache Spark #4 (Toronto) - Monday, December 14

IRELAND Storm, Spark Streaming + Prometheus Monitoring + Spark/Akka for Data Generation (Dublin) - Monday, December 14


London Big Data Meetup - Dec2015 (London) - Monday, December 14


Flink Meetup #12 (Berlin) - Wednesday, December 16


Apache Spark in Theory and Practice (Belgrade) - Friday, December 18


Apache Spark Workshop (Cluj-Napoca) - Wednesday, December 16


Spark on Mesos: "The Road Less Travelled" & Profiling Users Using Spark (Tel Aviv-Yafo) - Tuesday, December 15


Introduction to Apache Spark (Hyderabad) - Saturday, December 19


Introduction to Apache Spark: Lightning-Fast Cluster Computing (Kathmandu) - Saturday, December 19