Data Eng Weekly

Hadoop Weekly Issue #136

30 August 2015

Two projects that exemplify the push in the Hadoop ecosystem to continuously improve the status quo are the covered in this week's issue. First, Apache Flink, a project with similar features to Apache Spark that has been gaining momentum (particularly for use-cases where Spark falls short), is covered in several links. Second, there's a great presentation on Apache Phoenix, which aims to improve the usability HBase through SQL access. In other news, Apache Ignite and Lens graduated from the incubator, and Hortonworks announced the acquisition of Onyara.


One of the advantages of the cloud—short-lived, on-demand clusters—violates assumptions in Hadoop's design, which creates complications. For example, the MapReduce and Spark job history servers disappear when an ephemeral cluster terminates. Qubole describes how they've designed a solution to serve job histories via S3 even when the associated cluster has been terminated.

Gelly is Apache Flink's graph-processing library. It supports iterative graph processing using the vertex-centric and grather-sum-apply computation models. It includes built-in support for PageRank, single source shortest path, connected components, and several other algorithms. The introductory post describes all of these and also shows how to build music profiles via a user-user similarity graph.

This follow-up post to an earlier introduction to Apache Falcon discusses several features and uses of Falcon. These include backup and disaster recovery and late data handling (Falcon can compute preliminary and updated results as late data arrives). The post also discusses some areas where Falcon could use improvement and how it works well in conjunction with HCatalog.

Amazon Kinesis is a hosted service for streaming data with similarities to Apache Kafka. This post describes using the Cascading framework to perform a streaming analysis of microbatched data sourced from Kinesis. The tutorial describes how to deploy the necessary resources to ingest data, process it with Cascading on EMR, and store the results in Amazon Redshift for analysis.

The Confluent blog has an excellent post about distributed consensus with Apache Zookeeper and Kafka. It introduces consensus and atomic broadcast, describes consensus in Zookeeper, describes Kafka's in-sync replica sets, and gives an overview of a couple of failure scenarios in Kafka replication.

These slides are a fantastic overview of tuning HBase, with a focus on the Phoenix SQL-layer, for low-latency applications. The topics covered include optimizing Phoenix queries (and what to look for in EXPLAIN output), HBase regionserver config settings, JVM GC configuration options, schema considerations, HBase timeline consistent reads, OS-level tuning, and future work for Phoenix.

Slides from the recent Bay Area Apache Flink meetup have been posted. Topics covered include the state of the community (e.g. recently added and upcoming features), ongoing research topics being integrated into/with Flink (including streaming machine learning pipelines and streaming graphs), the Gelly Graph library, and an overview of stateful stream processing (with details about how Flink achieves exactly-once semantics).


Ernst & Young and Hortonworks announced a partnership this week. EY is adding data management offerings that use and build on the Hortonworks Data Platform.

Two ecosystem projects graduated from the Apache incubator this week. They are Apache Ignite, the distributed in-memory data fabric, and Apache Lens, the unified analytics platform.

Hortonworks has acquired Onyara, the company formed around Apache NiFi. Alongside the acquisition, they're announcing Hortonworks DataFlow powered by Apache NiFi. Fortune has a good article about how NiFi and Hortonworks DataFlow fits into the Internet of Things and positions Hortonworks within the competitive landscape.

Cloudera Co-Founder and Chief Strategy Officer Mike Olson has posted slides about Women in Big Data. The presentation covers both why it's important to improve diversity and some important steps towards that goal.

It's interesting to reflect on why Spark has gained so much adoption so quickly. This post describes six reasons: built-in libraries for advanced analytics, a simpler programming model, support for multiple languages, speed, vendor-neutrality, and high-growth/momentum.

The schedule for the upcoming Flink Forward conference, which takes place in Berlin on October 12/13, has been posted.

A year ago, the Spark vs. Hadoop conversation was mostly theoretical—companies weren't yet using it for any sort of production workloads. The tide seems to have turned, as is illustrated in this article which interviews several industry folks (from IBM, Zoomdata, Cognitive Scale, and Hortonworks) about how Hadoop and Spark are being used in several different contexts.

BlueData, makers of the EPIC software platform for managing Hadoop clusters, announced a $20 million series C round of funding. In addition, they announced a strategic collaboration with Intel (Intel Capital led the round of financing). Through the effort, BlueData and Intel will focus on areas like virtualization, containers, caching, and security.


The Google Cloud Platform has announced a new version of BigQuery, which adds support user-defined functions written in Javascript, support for querying data stored in Google Cloud Storage, and several other improvements.

Version 0.8 of the OSU High-Performance Big Data Benchmarks (OHB) was released this week. This version of OHB can perform microbenchmarks against Apache Hadoop 1.x/2.x, HDP, and CDH HDFS clusters. The announcements includes instructions for viewing example performance numbers.

Concurrent has announced version 1.3 of Driven, their performance management product for Hadoop applications. The new version brings support for Apache Hive and MapReduce (in addition to Cascading-based workflows), support for Cascading 3.0 and Apache Tez, improved SLA management, and more.

Hue version 3.9 was released this week. The release focussed on stability and UX, and it includes improvements to Spark, Search, and security.

Schedoscope, the scheduling framework for Hadoop, version 0.2 was released this week with support for a new MonthlyParameterization trait.

Version 4.1 of IBM BigInsights on Intel was released this week. The new version updates the base versions of eleven projects and adds Apache Kafka to the distribution. There are also improvements and new features for the proprietary portions of the distribution: Big R, BigSheets, Big SQL, Text Analytics, and Enterprise Module.

Eventsim is a tool for producing simulated web traffic data to either JSON files or an Apache Kafka queue. A post on the Interana blog describes why they built this tool, and several of the various parameters for tweaking the simulation.


Curated by Datadog ( )



Building a Streamlined Data Refinery with Pentaho and Hadoop (Palo Alto) - Wednesday, September 2

HadoopSF September 2015 Meetup (San Francisco) - Wednesday, September 2


Using HBase Co-Processors to Build a Distributed, Transactional RDBMS (Scottsdale) - Wednesday, September 2


Lambda Architecture: Real-Time Machine Learning (Jacksonville) - Tuesday, September 1

North Carolina

Designing Scalable Data Pipelines with Apache NiFi (Durham) - Thursday, September 3

New York

Powering Recommendations at Twitter: Overview and Algorithmic Details (New York) - Tuesday, September 1

Introduction to Apache Drill (New York) - Thursday, September 3


Advanced Data Science on Spark (Toronto) - Monday, August 31


A Tour of Hive (Aarhus) - Monday, August 31


Training: Stream Processing with Apache Flink (Berlin) - Wednesday, September 2


Yarn vs Mesos Deathmatch (Ramat-gan) - Monday, August 31

Apache Spark (Haifa) - Sunday, September 6


Emerging Big Data Technologies/Framew­orks (Hyderabad) - Saturday, September 5