Data Eng Weekly

Hadoop Weekly Issue #129

12 July 2015

One of the themes of this week's issue is new (at least to this newsletter) libraries and projects. These include a Rust implementation of timely dataflow, PipelineDB, YCSB, GrepPage, Dask, and ZKTraffic. Spark is also a major topic of coverage this week—there are several technical articles, the new SparkHub, and an announcement by Microsoft that Spark support for Azure HDInsight is in public preview. Finally, there are some high profile releases including Apache Drill 1.1 and Apache Hadoop 2.7.1, and a promotion from O'Reilly for Strata + Hadoop World.


This post aims to fill in the details of running Spark on Amazon EMR. It covers versioning, job submission, Spark executor configuration, and a complete Spark submission example using the AWS CLI client.

InfoQ has an interview with Martin Kleppmann, Apache Samza committer and author of "Designing Data-Intensive Applications." The interview covers topics like logs, databases, Samza, consensus, and more. The video and the transcript of the interview are both available.

PMML is a way to represent machine learning models using XML for cross-library compatibility. Apache Spark 1.4 introduced support for PMML for linear regression and k-means clustering. The Databricks blog has more details and an example of using PMML via Spark.

The Databricks blog has an overview of the new visualization features added for Spark streaming in Spark 1.4.0. This includes timeline visualizations of events/sec, histograms of # events per batch, information on scheduling/processing time, more detail on individual batches, and visualization of a stream DAG.

An article on Infoworld has some solid, practical advice for choosing a computation framework for Hadoop. It covers Spark, Storm, Tez, MapReduce, and Flink. It argues that Spark should be the default choice, Storm or Flink are most appropriate for latency-sensitive or per-event computations, Tez/MapReduce (or Cascading) are useful when Spark doesn't scale, and Flink can satisfy some niche use cases.

This post introduces a Spark tool for evaluating the quality of data in a DataFrame. The code evaluates several criteria on the DataFrame like the number of null/empty values, number of unique values, and the top-N commonly appearing values. The post walks through how each of these are calculated with the DataFrame API, which serves as a good non-trivial introduction to DataFrames.

This tutorial walks through configuring a Spark cluster in EC2 with support for RStudio. Spark 1.4.0 includes SparkR, which is loaded (via library form) within RStudio (after a few configuration changes to Spark).

This post explores the space between two recent papers which argued that 1) most distributed computations are cpu bound and not i/o bound and 2) in many cases, a single-threaded implementations performs better than a distributed system. Specifically, it looks at computing PageRank using a Rust implementation of timely dataflow. The results replicate the findings of 1) and 2) for Spark's GraphX, but show that the Rust implementation sees significant benefits when switching from a 1G to a 10G network.

The MapR blog has a post about Drill and Parquet's support for nested data structures (lists, structs, etc) that aren't supported by a traditional RDBMS. The post shows that this support can shrink the number of tables, simplify queries (by eliminating joins), and improve interoperability with nosql databases like HBase.

The LA Big Data User Group recent hosted Reynold Xin of Databricks for a presentation on Spark's DataFrame API. The slides and video, which cover Spark DataFrame basics, Python/R support, and more, have been posted.


Pivotal is running the Ambitious Apps at Amazing Scale Hackathon for the Apache Geode project (which is the open source version of Pivotal GemFire). Geode is a distributed, in-memory database, and it features integrations with Spark and HDFS.

This post discusses several ways that Hadoop can be deployed for multi-tenancy within an organization. It argues that shared infrastructure is the best solution (for cost and efficiency reasons), discusses several of the security and governance challenges for a shared cluster, and proposes that virtualization can be useful in when building a multi-tenant cluster.

In a post on the Index Ventures blog that announces their lead investment in Confluent's Series B financing, Mike Volpi argues that Kafka is at the heart of big data plumbing. To this end, it compares the tools that Cisco built to provide plumbing for the network to what Confluent is building.

Fortune has a summary of a recently released report by Ovum, which claims that big data software is expected to grow 50% by 2019. The article discusses how this doesn't quite match up with what some other analysts have been saying about the industry.

The DBMS2 blog has an inside look at Zoomdata, who build a visual analytics tool for big data. The post contains some interesting details of how they interact with several data sources and tie everything together with Spark.

Teradata has announced the next version of their Hadoop appliance, which adds support for Cloudera's CDH in addition to Hortonworks' HDP.

SparkHub is a new repository of talks, articles, packages, and events related to Spark.

A new book on Accumulo is available for digital download and ships later this month. The book covers Accumulo's architecture, APIs, server-side functionality, internals, administration, and more.


O'Reilly is offering readers of Hadoop Weekly a 20% discount on any pass to the upcoming Strata + Hadoop World with discount code HADOOPW. The conference takes place September 29 - October 1st in New York. See the link below for the agenda and speaker lineup.

In addition, Hadoop Weekly subscribers can enter a raffle for a free Bronze pass to Strata+Hadoop World. This pass gives access to all sessions, all keynotes, and more. To enter, visit the link below by July 22nd and provide your email address.


PipelineDB is a new open-source, streaming SQL database. PipelineDB's core is built on postgres, and thus postgres-compatible clients can be used to insert and query data.

Apache Drill 1.1 was released last week with support for automatically partitioning Parquet files, SQL window (aggregate and ranking) functions, enhancements to the Hive storage plugin, improved JDBC compatibility, and more.

MapR has released an Amazon EMR bootstrap action for provisioning Apache Drill as part of an EMR cluster. After adding the action, the Drill console is available on the master node and any node can act as a drill CLI client.

Apache Accumulo 1.6.3 was released this week. The release contains bug fixes (4 of which were severe), performance improvements, and additional testing. More details of the release are on the Accumulo website.

The Yahoo! Cloud Serving Benchmark (aka YCSB), released version 0.2.0 this week. This is the first release of the project, which aims to help evaluate key-value and other serving stores, in over 3 years. It includes verified bindings for Accumulo, Cassandra, HBase, MongoDB, and Tarantool. There are also untested bindings for Couchbase, DynamoDB, ElasticSearch, Gemfire, HyperTable, Infinispan, JDBC, OrientDB, and Redis. There are a lot of changes in this release, so be sure to checkout the full release notes.

Apache Hadoop 2.7.1, the first stable version in the 2.7.x line, was released this week. The new version resolves 131 issues since the 2.7.0 release.

GrepPage is a new website providing a search engine for many common commands for Hadoop and Hive. Type in a partial command and get back an example of the full command.

Dask is a python tool for parallel computing on multi-core or distributed system. Like many other tools, it provides a dataframe abstraction. Dask was originally written for parallelizing workflows for single machines, so it might work better than e.g. pyspark for data that fits on a single machine (but still offer the flexibility to scale to multiple instances if needed).

Microsoft has announced a public preview for Apache Spark on Azure HDInsight (their Hadoop-as-a-Service offering). Spark on HDInsight is integrated with Zeppelin and Jupyter for notebook-style development, and it's also integrated with Power BI, Microsoft's cloud-based visualization tool.

Cloudera has announced CDH 5.4.4, which fixes critical bugs with Hue and HiveServer2.

ZKTraffic is a tool for analyzing ZooKeeper traffic (like iptraf, trop) and gather stats. ZKTraffic can also be run as a daemon, and it provides a JSON HTTP endpoint with stats.


Curated by Datadog ( )



Stream Processing at Scale (Palo Alto) - Thursday, July 16


Breaking ETL Barrier with Spark Streaming (Bellevue) - Wednesday, July 15

Accelerating Hadoop Projects with the Cask Data Application Platform (Bellevue) - Wednesday, July 15


GraphX Tutorial, Algorithms, and Applications (Denver) - Tuesday, July 14


Data Mining with Apache Spark (Salt Lake City) - Wednesday, July 15


Apache Ambari by Ravi Mutyala (Houston) - Tuesday, July 14

New York

Spotify’s Music Recommendations Lambda Architecture (New York) - Monday, July 13

Elastic Analytics with Spark, Mesos, and Docker (New York) - Tuesday, July 14


Big Data Analytics with Apache Solr (Boston) - Tuesday, July 14


HBase, Meet BigTable (London) - Tuesday, July 14


Introduction Into Apache Spark (Den Haag) - Monday, July 13


Apache Flink (Hamburg) - Thursday, July 16


New ML Frameworks for Large Scale Data Science in Spark (Bangalore) - Saturday, July 18


Spark Meetup (Shanghai) - Saturday, July 18


Spark Summit Recap + R on Spark + Tableau Spark Driver Demo (Melbourne) - Monday, July 13