Data Eng Weekly

Hadoop Weekly Issue #131

26 July 2015

This week's theme is new projects—Cloudera announced Ibis for Hadoop-scale data science with Python, Spree provides an alternative web UI for Spark, and Astro adds HBase support to Spark SQL. In addition, Hortonworks released HDP 2.3 and Cassandra 2.2 adds a number of interesting new features. There are also several articles, which provide excellent guidance, for folks starting to build out systems in Spark, Kafka, and Flink.


This post describes using HyperLogLog (HLL) for estimating things like click-through rates for ads. To run at scale, the post describes how to use HLL from Spark, including how to build a custom Spark aggregation function. Finally, it describes a system that uses a pre-built/cached SparkContext to power a REST API server to run Spark queries with 1-2s response times across 100GB of data.

This post describes the tradeoff between throughput and latency in the Kafka Producer and how to tweak the various producer settings to achieve desired performance.

The MapR blog has a great introduction to Hive's new support for transactions. The post describes some use cases, the supported semantics, how to enable transaction-support, and a brief overview of how it works (including how compactions clean up delta files).

Cloudera has announced a new project called Ibis, which has the goal of marrying Python data analysis/science libraries with Cloudera Impala to build a fast, native distributed framework. The project will including LLVM integration for code generation from Python code. The Cloudera blog has two posts about it—the first announced and describes the project goals and the second includes getting started instructions and more on contributing.

With the caveat (as always) that benchmarks should be taken with a grain of salt, Pivotal has posted part 1 of a comparison of HAWQ (their SQL-on-Hadoop) to Hive and Impala. The tests were performed on a 15 node cluster with the 30TB TPC-DS dataset size. Among the highlights, HAWQ shows speed improvements over Hive and Impala, and it has full support for all of the TPC-DS queries (whereas Hive and Impala do not).

This post from the MapR blog compares MapReduce and Spark. It looks at the difference between the execution models, the differences in expressiveness (the Spark API having many more high-level operations like join and group by), and the library support (Spark includes machine learning, graph programming, and SQL as part of the core release).

This interview with some folks from dataArtisans about Apache Flink has some interesting details about the project. Topics include how it compares to Spark/Samza/Storm, Flink's approach to iterative processing, data streaming in Flink, and the Flink roadmap.

This post describes several parameters, including how to debug them, for best utilizing resources on a YARN cluster. In additional to the basic memory settings, the post covers the virtual and physical memory checker, several common exceptions (and their solutions), and a few MapR-specific settings.


In comparison to the Spark vs. MapReduce article above, this post on Hadoop and Spark is aimed at a broader, non-technical audience. It describes the key differences between the two compute frameworks without getting too far into the implementation details.

The Wrangle Conference is a new event for data scientists taking place in San Francisco on October 22nd.

With another caveat about vendor benchmarks, ZDNet has a summary of a recent VMware benchmark that shows Hadoop in VMWare is just as fast as bare-metal, even when running 4 VMs per instance.

InfoWorld has an article about Hadoop at Yahoo. Topics include scale (43K servers in 20 YARN clusters, 600PB of storage, and 33 million jobs per month) and which systems they use (HBase, Spark, Oozie, Pig, Storm, Tez, Hive). Yahoo's Hadoop usage is over half Pig and only a small percentage (1%) Spark. Interestingly, though, Yahoo expects to be off of MapReduce by the end of year (in favor of Tez or Spark).


Apache Samza 0.9.1 was released on the 13th. This is a bug-fix release that resolves 7 tickets.

Apache Cassandra 2.2.0 was released this week. This release adds Windows as a supported platform, support for roles in authentication and authorization APIs, off-heap row-cache, and much more. The NEWS.txt in the announcement includes upgrade instructions.

Version 2.3 of the Hortonworks Data Platform was released this week. The release contains new versions of nearly all components and adds Atlas and Cloudbreak to the list of supported projects. The Hortonworks blog has highlights of the new features in Spark, Hive, stream processing, and more

Astro is an open-source project from Huawei that adds support for HBase to Spark SQL. Astro requires Spark 1.4.0.

Spree is a new project that provides an alternative, auto-updating web UI for Spark. Information about the individual tasks is forwarded from Spark using a custom DAGScheduler listener to slim, which is a node.js server that writes to Mongo. Spree uses the Meteor web framework and React to auto-update the UI.


Curated by Datadog ( )



Real-time Advanced Analytics: Spark Streaming+Kafka, MLlib/GraphX, SQL/DataFrames (San Francisco) - Tuesday, July 28

Spark and Verizon (San Jose) - Tuesday, July 28

Apache Ambari and Its Role in the Open Data Platform (Palo Alto) - Wednesday, July 29

Twitter Heron: Stream Processing at Scale (San Francisco) - Thursday, July 30

The Future of Data, with Doug Cutting (Sunnyvale) - Thursday, July 30


Spark 101 (Saint Paul) - Thursday, July 30


Reactive Stream Processing with Kafka-Rx (Chicago) - Tuesday, July 28


Hands-on Demonstration of Apache Drill (Madison) - Tuesday, July 28


Cleveland Big Data and Hadoop User Group (Mayfield Village) - Monday, July 27


Apache Spark: What Is All the Hype About? (Saint Petersburg) - Wednesday, July 29

North Carolina

F#/Analytics: Much Ado about Hadoop (Cary) - Tuesday, July 28

Best Practices on Building a Hadoop Data Lake Solution (Charlotte) - Wednesday, July 29

New York

Double Feature: Hadoop & Java 8 Stream Debugging (New York) - Monday, July 27


Data Crunching with Spark + Data Warehouses in Business (Bogota) - Thursday, July 30


Real-time Stream Processing with Batch Analytics (Manchester) - Wednesday, July 29


Flink 0.10 & Comparing Flink to Other Streaming Systems (Berlin) - Wednesday, July 29


Hadoop Ecosystem (Haifa) - Tuesday, July 28


Apache Spark: An Open-Source Revolution Comes to Big Data Analytics (Bangalore) - Friday, July 31

A Deep Dive into Spark Dataframe API (Bangalore) - Saturday, August 1


Spark 1.4 Announcement + Spark Summit Recap + Tableau Spark Driver (Sydney) - Monday, July 27