Data Eng Weekly

Hadoop Weekly Issue #134

16 August 2015

This week's newsletter is quite short (folks in the northern hemisphere must be enjoying their summer!), but there are a couple of great articles. Specifically, two technical posts give practical advice based on real-world experience. Also, there are a few releases, including a new Gradle plugin for Hadoop that was open-sourced this week by LinkedIn.


The Cloudera blog has a guest post from Barclays about how they moved from SQL to Spark and Scala to improve the computational speed and development workflow for their Insights Engine. The post describes the problem, the solution, and provides a number of tips for working with Scala and Spark: an introduction to functional programming, understanding the resource constraints in Spark, suggestions for efficient memory representations, and more.

This tutorial describes how to migrate data from MySQL to Cassandra using PySpark and the Spark Cassandra connector. In addition to the code required for the migration, the post discusses schema design in Cassandra and explains how to denormalize one of the tables.

The SparkOnHBase code, previously part of Cloudera Labs, has been integrated into (an unreleased version of) Apache HBase. This post describes the implementation and API of the new module and discusses some areas of future work.

The Qubole blog has a guest post which describes a recent evaluation of several SQL engines for Hadoop. Unlike many other benchmarks, this one focusses on Hadoop in the cloud. Specifically, they looked at Spark SQL and Presto on four different file formats. Like all benchmarks, it's usually best to try things out yourself, but in this case they found Spark SQL was the best fit. The post describes the evaluation criteria (which include a few notes specific to Amazon S3) and also why Pearson is using Qubole.

This post gives an overview of Apache Spark DataFrames with example translations from Pandas DataFrames. Regardless of your familiarity with Pandas, the post is a good overview of column projection, adding columns, filtering, aggregation, and windowing operations.


Databricks has introduced the Databricks Academic Partners program, which provides free access to the Databricks platform for teaching and research.

InfoWorld has an article describing several common projects for which companies are using Hadoop and Spark. These include specialized analysis, Hadoop as a service, streaming analytics, complex event processing, and streaming ETL.

In a good complement to the previous post, this post describes several concrete examples of real-time applications powered by Spark. These include fraud detection, network security, ad processing, and medical applications.

On August 27, 2015, the HadoopSphere Virtual Conclave, which is a virtual conference covering Hadoop, Spark, and Tajo, is taking place.

This post looks at the big data stack at WebTrends—they've adopted a number of the key technologies that have gained momentum over the past year. Specifically, they're running Spark on YARN in the cloud. This has helped them keep down costs and improve performance. The post also talks about some of the security-related features of Spark.


A new release of HP Vertica and the Haven Big Data Platform includes enhanced support for Apache Hadoop and an integration with Apache Kafka. Specifically, the system can run SQL queries directly against data stored in ORCFiles in HDFS and supports ingestion from Kafka for real-time analysis.

Cloudera Director, the system for running Hadoop in the cloud, released version 1.5 this week. The new release adds support for the Google Cloud Platform (and a plugin interface to support additional providers), improved security and customization, and more.

The Google Cloud Dataflow and Cloud Pub/Sub systems are now out of beta and are geenrally available. Dataflow is a system for streaming and batch analysis that is fully managed and Cloud Pub/Sub provides a mechanism to link various services and APIs (including DataFlow).

LinkedIn has open-sourced their Gradle plugin for Hadoop. The plugin and accompanying DSL are useful for developing Hadoop workflows with jobs in various frameworks.

WANdisco Fusion 2.6 was released this week. The new version includes support for network shaping and prioritization for replication across data centers.


O'Reilly is offering readers of Hadoop Weekly a 20% discount on any pass to the upcoming Strata + Hadoop World with discount code HADOOPW. The conference takes place September 29 - October 1st in New York. See the link below for the agenda and speaker lineup.


Curated by Datadog ( )



Large Scale Distributed ML on Spark (Santa Clara) - Thursday, August 20

Spark Streaming & Kafka: The Future of Stream Processing (Santa Monica) - Thursday, August 20


Self-Service Data Exploration and Nested Data Analytics: Introduction to Drill (Denver) - Wednesday, August 19


SOLR and Cloudera Search (St. Louis) - Tuesday, August 18


Practical Tips on Running Spark on Hadoop & Machine Learning in the Wild (Ann Arbor) - Thursday, August 20


Experiences with Spark 1.4 and R (Mason) - Wednesday, August 19


Document Classification on Apache Spark (Atlanta) - Wednesday, August 19


Spark Jeopardy at Zoomdata! (Reston) - Tuesday, August 18


Using Numerical Libraries on Spark (London) - Tuesday, August 18


Introduction Into Apache Spark (Leidschendam) - Tuesday, August 18


A Deep Dive Into Apache Spark Internals (Hyderabad) - Saturday, August 22


Shanghai Big Data Streaming 1st Meetup (Shanghai) - Saturday, August 22

Apache Spark Startup (Xian) - Saturday, August 22


Data-Intensive Applications with Hadoop and Spark (Sydney) - Thursday, August 20