Data Eng Weekly

Hadoop Weekly Issue #154

24 January 2016

Often there isn't a clear theme to a week, but stream processing is the hot topic this issue. Google has submitted to the Dataflow SDK to the Apache incubator, there's a great article on streaming data processing from O'Reilly, and there are several articles about Apache Kafka. In addition, there is some fundraising news for two Hadoop ecosystem companies, are several releases, and is a mix of other content.


Datanami has a thorough comparison of SQL-on-Hadoop engines (both vendor-backed and open-source). The post has a useful bucketing of engines into batch-oriented, interactive, and in-memory as well as a discussion of other important considerations (such as supported file formats). It also notes that we'll likely see some consolidation in the near future, which is important to keep in mind as one evaluates tools.

The acmqueue has a great article about immutability in computing. The decreasing costs of storage has enabled systems built on immutable/append-only components such as GFS/HDFS (which are discussed in this post) and Kafka. In addition to these, the article explores several other types of systems (e.g. relational databases, distributed systems), hardware (SSDs), and system patterns (copy-on-write, replication in distributed systems, fault tolerance) that make use of or provide immutable semantics.

O'Reilly has a long, in-depth article about streaming data processing. It's a follow up to the recent "Streaming 101" post, and it covers topics like event-time vs. processing-time, windowing, watermarks, triggers, and accumulation. The article is full of figures and animations describing these core concepts that make up the what, where, when, and how of data processing.

The Databricks blog has a post on the new features of MLlib in Apache Spark 1.6. The post describes (and links to relevant examples of) a few of these—pipeline persistence, new ML algorithms, and improved MLlib integration for SparkR.

This presentation describes how Rocana is building a search system for large-scale (10s of TB/day) data atop of Kafka and HDFS. The slides present the reasons for building a custom search solution, the architecture of the system, how events are collected/partitioned, the write path through Kafka to HDFS, and the basics of the query system (which takes advantage of things like HDFS short circuit reads).

On the heels of the new producer API in Kafka 0.8.1, version 0.9.0 introduced a new Consumer API. The new API removes the distinction between a simple and a high-level client, removes the dependencies on the Scala runtime and ZooKeeper, adds security extensions, and more. The post describes how to get started with the new client via code snippets, demonstrates an example polling client, discusses delivery semantics (which is related to offset management), and more.

This post explores a gotcha related to the old Kafka Producer APIs default support for byte arrays. It's a clear description of a rather subtle issue, and it provides good context on some of the Kafka Producer API internals.


Hadoop Summit Europe is still a couple of months away, but the Hortonworks blog has previews of two of the community choice winners. The first is about Apache Flink at Capital One, and the second discusses machine learning with big data.

Google, along with developers from a number of other companies, have proposed incubating Google Dataflow SDK at the Apache incubator. The SDK provides a high-level API for batch and stream processing with a pluggable backend (Spark, Flink, single-node local runner, and Google hosted Cloud Dataflow are all supported).

Datanami has a summary of key points from the recent Forrestor report on Hadoop distributions. It mentions the distribution leaders (Cloudera, Hortonworks, IBM, and MapR), some of the differentiators among distros, market presence, and more.

Hortonworks announced this week that they're seeking an additional $100 million in funding as part of a secondary share offering. Hortonworks stock was down after the announcement but made back some ground towards the end of the week.

Qubole, makers of the Qubole Data Service, announced that they've secured $30 million in Series C financing. In the post, Qubole notes that customers are processing over 250 petabytes each month using their platform across Amazon Web Services, Google Cloud Platform, and Microsoft Azure.


Version 0.1.0 of kudu-python was recently released. This is a python API to Apache Kudu (incubating) that uses the C++ Client API.

Apache Apex has announced version 3.3.0-incubating of the Malhar library. Malhar is a library of operators and adapters for real-time streaming applications. The new release contains a number of bug fixes, improvements, and new features such as support for anti and semi joins and support for Kafka 0.9.0's new consumer API.

Cloudera has announced version 2.0 of Cloudera Director, their tool for managing CDH clusters in the cloud. The new release adds support for spot instances, high availability, kerberos configuration, automatic job submission, RHEL 7.1, and more. The Cloudera blog has many more details on Cloudera Director.

Spark-TS 0.2.0 is the second version of the Spark time series library from Cloudera. The new release switched to java.time in order to support nanosecond precision, a more developed Java API, and more.

Version 3.3.0 of the Cask Data Application Platform was released. Major features of the new release include improvements to CDAP metadata and the Cask Hydrator.


Curated by Datadog ( )



Big Data Application Meetup (Palo Alto) - Wednesday, January 27

Data Ingest at Scale: Lessons from PlanetLabs and Uber (Mountain View) - Wednesday, January 27

Building and Scaling Data Pipelines (San Francisco) - Wednesday, January 27

Big Data for the Enterprise, Part 1 (San Francisco) - Wednesday, January 27

Evening with Martin Odersky! + Spark Approximations + Twitter Algebird (San Francisco) - Thursday, January 28


Seattle Scalability Meetup (Seattle) - Wednesday, January 27


Apache Spark 101: Introduction and What's New (Englewood) - Tuesday, January 26


Hadoop, HBase, and Spark by John Leach (Houston) - Thursday, January 28


Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, January 25


SPARKling Analytics by Ravi Nair (Jacksonville) - Tuesday, January 26

Apache NiFi: Joe Witt of Hortonworks (Orlando) - Tuesday, January 26


Keeping Cool Under Pressure with Apache NiFi (Atlanta) - Thursday, January 28


Interactive Visualization + Leveraging Spark in a Hybrid OLTP/OLAP (Reston) - Tuesday, January 26


DataPhilly January 2016 (Philadelphia) - Wednesday, January 27

New York

Real-Time Big Data (New York) - Wednesday, January 27


Toronto Apache Spark #5 (Toronto) - Wednesday, January 27


The Data Pub January 2016 (Mexico City) - Monday, January 25


Building Your First Spark Streaming Application (Bath) - Thursday, January 28


Big Data, No Fluff: Let’s Get Started with Hadoop #5 (Oslo) - Thursday, January 28


Spark and the Combination of Different Modules (Madrid) - Wednesday, January 27


Establishment of a Hadoop Big Data/Mesos Infrastructure (Paris) - Wednesday, January 27


Data Processing Using Amazon Web Services: A Panel Discussion (Antwerpen) - Tuesday, January 26

Kafka and HortonWorks Use Cases (Brussels) - Tuesday, January 26


Apache Flink Meetup Berlin #13: Roadmap 2016/Implementing BigPetStore (Berlin) - Tuesday, January 26

Python & Spark by Thorsten Greiner (Dusseldorf) - Wednesday, January 27

Big Data, Berlin (Berlin) - Thursday, January 28


Getting the Most Out of HBase! Transactions and Advanced Caching (Tel Aviv-Yafo) - Wednesday, January 27

From Legacy DWH to State-of-the-Art Hadoop & Vertica Data Platform by AOL (Tel Aviv-Yafo) - Sunday, January 31


Exploring the Goodness of MapReduce, Hive & Spark (Gurgaon) - Thursday, January 28

Interactive Analytics Using Apache Spark (Bangalore) - Saturday, January 30

Spark Streaming and MLlib (Hyderabad) - Saturday, January 30