Data Eng Weekly

Hadoop Weekly Issue #143

25 October 2015

There are several articles this week showing Spark's versatility for data analysis—from geospatial indexing with Magellan to using Spark with OCR libraries to Spark with Redshift. And speaking of Spark, Spark Summit Europe is this week in Amsterdam. Also, there's an interview with Arun Murthy about the future of Hadoop, and Luigi hit version 2.0.0.


InMobi has capped off a four part blog post series about Real-Time Stream Processing. Throughout the series which focusses on comparing Storm and Spark Streaming, they've described the uses cases at InMobi, given an overview of the two platforms, described various evaluation criteria, and presented their findings and recommendation. Citing Storm's maturity (and many other differences), they have chosen Storm for their internal systems.

The Cloudera blog has a post describing how to build a Spark job to index scanned documents and store results in HBase. The actual Spark job is short and easy to understand, so the majority of the post is devoted to background, installation details (OCR libraries require some native system libraries), and how to configure Cloudera Search to index extracted data.

Hortonworks has a tutorial for Magellan, which is a Spark package for geospatial analytics. To demonstrate some of Magellan's features, the tutorial joins data about Uber trips with neighborhood data to analyze neighborhoods popular for trips (which requires some transformations).

To demonstrate Spark ML pipelines and their Spark Extensions Library, Collective has written a guest post on the Databricks blog about Audience Modeling. Much of the post is devoted to how they use the Google S2 Geometry Library for building geo features, and it wraps up by showing how to use the ML pipelines API to train a model.

This post on Scalding is the latest in a series describing how to implement an outer join in various Hadoop frameworks. It serves as a good introduction to using Scalding for a non-trivial task and how to test a Scalding job.

This post describes how to use the Apache Ambari REST API to fetch and update the configuration for a cluster, which is a great way to automate configuration changes.

The Livy Job Server provides a REST API for interacting with Spark. Livy supports running a job by specifying a jar in HDFS, a main-class, and job arguments in an HTTP POST. After submitting, output can be fetched using the API. The overview also describes how to run Spark Streaming and Python jobs using this API.

The MapR blog has a post introducing the new analytic and window functions in the Apache Drill 1.2 release. The post uses Yelp review data to illustrate OVER, FIRSTVALUE, LASTVALUE, LAG, LEAD, and more.

The IBM Hadoop Dev blog has a post on the HDFS trash. It describes the important settings, checkpoints, how to empty the trash, and best practices for a production cluster.


Spark Summit Europe is this week in Amsterdam. The full schedule is available online, and there is a live stream on Wednesday and Thursday.

SD Times has an interview with Arun Murthy of Hortonworks on the future of Hadoop. The whole article is worth reading, but a few things stood out—Arun predicts that YARN and HDFS will become "POSIX for the data world" and he is thinking a lot about how to improve the process of writing apps for YARN (both Docker and Apache Slider are mentioned in this context).

The Hortonworks blog has a recap of the recent Apache Ambari Hackfest. The winning projects were an Ambari Cassandra Service, a Catalog Service for Ambari, and an Ambari Service Deployer. The post has links with more details on these and other projects from the event.


spark-redshift is a new package that provides a Spark DataSource for Amazon Redshift. For large data sets, the package offers better performance than the JDBC drivers, particularly when loading data into Redshift (by using the Redshift COPY command rather than JDBC INSERTs).

Splice Machine has announced version 1.5 of their Hadoop RDBMS. The new version improves SQL compliance, improves compatibility with BI tools, adds new incremental backup support, provides new Window functions for analytical queries, and improves performance.

Cloudera Enterprise 5.3.8 is a new bug fix release addressing issues in CDH and Cloudera Manager.

Version 2.0.0 of Luigi, the data workflow system, was released. It includes a new visualizer, a new summary at the end of CLI runs, support for Amazon EC2 Container Service, speedups to scheduling, and more.


Curated by Datadog ( )



Accelerating Hadoop Projects with the Cask Data Application Platform (San Francisco) - Tuesday, October 27

Deep Learning Architecture Using Tachyon and Spark + Tachyon New Features (Sunnyvale) - Wednesday, October 28

Special Event with Kostas and Stephan: Updates on Flink Forward 2015 (Santa Clara) - Wednesday, October 28

An Evening with Jay Kreps, Author of Apache Kafka, Samza, Voldemort & Azkaban (Playa Vista) - Friday, October 30


Cask Data + Azure Data Lake (Seattle) - Wednesday, October 28


IoT with Spark Streaming & Hadoop with Kudu (Austin) - Monday, October 26


Architecture of Flink's Streaming Runtime & Flink at Capital One (Chicago) - Thursday, October 29


NiFi, Zeppelin, HBase (Dublin) - Tuesday, October 27

North Carolina

Hadoop Data Security with Apache Ranger (Durham) - Tuesday, October 27

New York

Mesos @ Bloomberg (New York) - Monday, October 26


Felipe Hoffa on Massive Public Data Sets and Google BigQuery (Cambridge) - Tuesday, October 27


Spark Meetup with Databricks and IBM (Paris) - Monday, October 26

#8 Big Data & Hadoop (Toulouse) - Wednesday, October 28

Apache Flink: A Simple and Practical Introduction! (Paris) - Thursday, October 29


Spark after Dark with Chris Fregly & IBM (Brussels) - Friday, October 30


Pre-Summit Event: Factorization Machines in Spark and TBA (Amsterdam) - Tuesday, October 27

All-Star Cassandra/Spark Night at ING (Amsterdam) - Wednesday, October 28


Spark Introduction (Warsaw) - Wednesday, October 28


Apache Spark In-Memory Data Processing (Perth) - Wednesday, October 28