Data Eng Weekly

Hadoop Weekly Issue #135

23 August 2015

As more companies adopt Spark, several are using it without a Hadoop cluster. An article in this week's issue describes how Grammarly runs Spark on EC2 (with S3 for storage), and Mesosphere has announced a new product that marries Spark, Mesos, Cassandra, and more (but no HDFS). On the topic of Mesos, there are also articles this week on Chronos (scheduler for Mesos) and running Mesos on a CDH cluster. In addition to these, there are plenty of other great links this week, such as the slides describing the Hadoop shell script rewrite and the BigTop 1.0.0 release announcement.


The latest of the Call me Maybe series looks at Chronos, which is a task scheduler for Mesos clusters. The series evaluates distributed systems in the presence of failure using the Jepsen framework. Even for folks without a lot of background knowledge of Mesos or Chronos, it's easy to follow the post and understand the results.

The team at Grammarly recently switched from Amazon EMR with Pig to Apache Spark (using the spark-ec2 scripts). This post describes their migration and a number of fixes to issues that they ran into while deploying Spark on EC2. Several of the fixes relate to the S3 Hadoop FileSystem implementation, but there are also a few related to memory management. Overall, it's a fantastic overview for anyone running Spark... but particularly so if you're on EC2.

This post is a quick introduction to Apache Gora, which is a framework to serialize in-memory data models for storage in systems like HBase, Cassandra, and MapReduce. The post covers Gora's main components, Gora's features, and the datastores supported by Gora.

This week, the morning paper covered articles on Google Cloud Dataflow, Apache Flink streaming, and Google's Millwheel stream processing system. As usual, each post has an extended summary of a paper, which provides enough background to understand the key contributions of the paper.

Hive has a few different types of User Defined Functions, which have been covered in a series on the Beekeeper Data blog. This is the third post in the series, and it covers the User Defined Aggregate Function (UDAF) version of the API. For the walkthrough, the post implements a method to sum and count the total number of letters in a column of strings. The are a few subtleties of the API, which are explained alongside example code.

This post describes a number of important algorithms and concepts that are often used in streaming algorithms. The topics covered are hashing and sketching, hyper log log, count min sketch, streaming k-means, and quantiles via t-digest. Each of the concepts is explained both in writing and graphically.

The Cloudera blog has a guest post from Big Industries that details how to run Apache Mesos using Cloudera Manager. To achieve this, Big Industries has built a Custom Service Descriptor and published a parcel repository. After configuring Cloudera Manager for these, it's fairly easy to launch an application from a docker image via the Mesos/Marathon integration. All of the source code is available on github.

Spotify has been moving a number of their big data applications from python to Scala. This presentation describes the tools and libraries (such as scalding, annoy, and bloom filters) behind recommendations and other big data applications at Spotify.

If you've ever battled with the Hadoop shell code, you're probably familiar with how difficult and obtuse it can be (even getting the help text has been famously difficult). This presentation describes the rewrite of the Hadoop shell code which was committed to trunk just over a year ago (but hasn't yet been part of a release). The rewrite unifies a lot of functionality/scripts, improves usability and operability, and adds a number of new features like .hadooprc for client config, the ability to customize internal functions (like log rotation), and support for starting secure daemons as non-root users.

The Hortonworks blog has a post on fault tolerance for Nimbus, which is the daemon responsible for managing topologies in an Apache Storm cluster. The post describes the active-standby Nimbus architecture across the areas of leader election, state storage, and leader discovery.

With Apache Drill in standalone mode, it's incredibly easy to convert a CSV file to another format like Apache Parquet. This walkthrough describes how to do just that—it's a simple query that scales with the number of columns in the input CSV dataset.

Hue is a web interface for Hadoop, and it's recently gained support for Spark. The Spark application includes interactive querying with Scala, Python, and R using the Livy job submission server. The Hue team has published slides about the architecture of the new integration, which is in its third iteration.


This post has an interesting visualization of all of Hadoop releases over time. From the graphics, it's easy to see a few pattern. For example, release velocity slowed in 2009, and there were some big lapses in between releases (e.g. 0.21.0 to

This post articulates concerns about the community of some Hadoop-related projects, which are dominated by a single vendor.

MapR has been available on Amazon Web Services via Amazon EMR for some time, but this week MapR added the ability to deploy via the AWS Marketplace. Using the Marketplace application, it's possible to start a cluster from pre-built Amazon Machine Images and take advantages of features not supported in EMR.

Forbes has a look at Confluent, the commercial company behind Apache Kafka. The article talks about the origins of the technology at LinkedIn, some of the companies using Kafka, several Confluent customers, and the next release of the Confluent Platform (scheduled for October).

An article on ZDNet highlights "five open-source big data projects to watch." There's a quick blurb on each of them: Apache Flink, Apache Samza, Ibis, Apache Twill (incubating), and Apache Mahout.

Spark Summit EU is October 27-29 in Amsterdam. Day 1 is training, and days 2 and 3 are presentations.


Apache Bigtop 1.0.0 was released this week. The release is based on Hadoop 2.6.0, includes HBase 0.98.12, Ignite 1.2.0, Spark 1.3.1, and much more.

Mesosphere Infinity is a new product focussed at stream processing and big data applications. It's powered by Mesos, Spark, Kafka, Cassandra and Akka. Notably, the system uses Cassandra for persistence rather than HDFS.

Cloudera Enterprise 5.4.5 was released this week. It includes fixes for Impala, Parquet support, MapReduce, and more. There are also several fixes to Cloudera Manager.


Curated by Datadog ( )



Why Understanding How Spark Is Built Helps the Community (San Francisco) - Tuesday, August 25

Spark and Hadoop in Production: Real Life Stories (San Francisco) - Wednesday, August 26

Apache Ignite: In-Memory Data Fabric (Oakland) - Wednesday, August 26

Come Learn about PigPen: Data Engineering with Clojure and Hadoop (Los Gatos) - Thursday, August 27

Distributed Stream and Graph Processing with Apache Flink (San Jose) - Thursday, August 27


What's New with Hortonworks 2.3, Product Review and IoT Demo (Arlington) - Tuesday, August 25


Apache Drill (Saint Paul) - Thursday, August 27

North Carolina

Overview of the Upcoming Apache Atlas Project (Charlotte) - Wednesday, August 26


Is Apache Spark Ready for the Cloud? (Vienna) - Thursday, August 27


Hadoop in Industry + Example of Akka and Spark (Bogota) - Thursday, August 27


Flink Meetup #10: Cluster Deployment on Docker, Flink/Gelly, Community Update (Berlin) - Wednesday, August 26


End of Chaos in the Hadoop Cluster (Warsaw) - Wednesday, August 26


Hands on with Apache Spark (Cairo) - Saturday, August 29


Spark Meetup (Shanghai) - Saturday, August 29

Data Technology Meetup #2 (Nanjing) - Sunday, August 30


MLDM Monday: Big Data Series (Taipei) - Monday, August 24


Apache Spark 101 and Smart Cities and Big Data (Sydney) - Monday, August 24