Data Eng Weekly

Hadoop Weekly Issue #189

02 October 2016

Strata + Hadoop World was this week, so this issue is full of news and releases. Highlights include a new version of the ODPi runtime, a new R interface for Spark, and a new version of the Confluent Platform with more enterprise features. In technical articles, there's a great overview of best practices for long-running Spark Streaming jobs on YARN, and an introduction to a new graph computation framework from the folks at Berkeley AMPLab.


GraphTau is a new programming model for graph computation on changing graphs. Developed by the folks at Berkeley AMPLab, it's built on Spark's GraphX. Using a "pause-shift-resume" pattern that takes advantage of graph snapshots, it can greatly reduce the amount of computation needed when a graph changes.

Cloudera has written about the current state of using Apache Kudu as a backend to Apache Impala (incubating), which includes support for CREATE, DROP, INSERT, UPDATE, and more.

This guide provides a walkthrough of building a Kafka cluster on AWS using CloudFormation and writing a Spark streaming job that runs on Amazon EMR to analyze the data on Kafka.

Apache MADlib (incubating) is a library for SQL-based machine learning that supports Apache HAWQ (incubating), PostgreSQL, and others. Version 1.9.1 was recently released, with support for pivot, sessionization, and prediction metrics. The Pivotal blog has details on how to use these three new features.

The IBM developer blog has distilled the process of enabling security for Hadoop web interfaces to a few steps. This post summarizes them and also discusses a couple of other configuration options for this setup.

This post provides a fantastic overview of the practical considerations of using YARN for long-running Spark streaming jobs. It covers the necessary command-line options for spark-submit to keep a long-running job alive, suggestions for YARN queue configuration, details on configuring kerberos ticket refresh, logging and monitoring suggestions (and example configs for ELK and Graphite), and details on implementing graceful shutdown.


The SAP acquisition of big data-as-a-service vendor Altiscale has officially been announced.

ODPi announced version 2.0 of it's runtime specification for Hadoop distributions. Major changes include the addition of a Hadoop Compatible File System spec (the article lists a number of compatible implementations) and the addition of Apache Hive 1.2.

Strata + Hadoop World was this week in New York. ZDNet has a good summary of the announcements from the event, including those from MapR, Cazena, BlueData, and more.

The Cloudera blog has an overview of a new Apache Incubator project, Spot, which comes for the Open Network Insight (ONI) project. The project is a collection of security tools originally developed by Intel.

Akamai has acquired Concord, maker of the Concord stream processing framework built on Apache Mesos.

Among the vendor announcements from this week, Confluent has announced a new release of Confluent Enterprise that's shipping later this month. The highlights of the release are multi-datacenter replication and automatic data balancing. The introductory blog post describes these features in more detail.


Version 2.3.2 of Luigi, the workflow engine written in Python, was released.

StreamSets has announced version 2.0 of the StreamSets Data Collector. Highlights include support for Oracle CDC, MapR 5.2.0 (and MapR Streams), and integration with StreamSets Dataflow Performance Manager.

Apache Phoenix 4.8.1 was released. It resolves 43 (mostly bug fix) issues.

IBM has announced that Big SQL can now run on Hortonworks HDP in addition to its own distribution, IOP.

RStudio has announced a new open-source project, sparklyr, which is an R interface to Spark. It supports dplyr verbs against spark tables, suppot for SQL queries, Spark MLlib & H20 Sparking Water integrration for machine learning, and additional extensions.

Microsoft has announced that Hortonworks HDP 2.5 with Spark 2.0 and Hive Live Long and Prosper is now generally available on Azure HDInsight. The release also includes security enhancements—integration with Azure Active Directory and support for transparent encryption at rest.


Curated by Datadog ( )



Robust Stream Processing with Apache Flink (San Francisco) - Wednesday, October 5

Using Spark to Accelerate Big Data at Dollar Shave Club (Marina Del Rey) - Thursday, October 6


Apache Kafka, Stream Processing, and Microservices (Austin) - Tuesday, October 4

Dean Wampler: Why Scala Is Great for Data Science and Engineering (Austin) - Wednesday, October 5


Free Spark Training Session (Chicago) - Tuesday, October 4


Data Science with Apache Spark (Milwaukee) - Tuesday, October 4


Scale Out and Optimize Spark 2.0 (Laurel) - Monday, October 3

New Jersey

Introduction to Alluxio Formerly Tachyon - Thursday, October 6

IRELAND Machine Learning and the Serving Layer + Successful Big Data Architecture (Dublin) - Monday, October 3


Spark v2.0 Workshop (Bucharest) - Friday, October 7


Introduction to Machine Learning with Apache Spark & Apache Zeppelin (L'viv) - Friday, October 7


Learn Distributed Tracing and Data Applications at Twitter (Singapore) - Friday, October 7