Data Eng Weekly

Hadoop Weekly Issue #184

28 August 2016

This week's issue has several tutorials, an update on Flume/Kafka integration, and a great overview of flavors of distributed databases. In news, Altiscale is being acquired, and there's a post exploring the role of analytics RDBMSes. Finally, there were several releases this week including Hadoop and Kudu.


Apache Gearpump (incubating) is a streaming engine built on Akka. This presentation describes Gearpump's architecture by looking at several "hard parts" (e.g. exactly once, out-of-order) of stream processing that it solves.

This tutorial describes how to consume tweets using Twitter's streaming API and process them with Kafka Streams. The code to accompany the post is available on github.

Apache Flume 1.7 will be switching to the new Apache Kafka client APIs. This post describes some of the changes coming with that change—for example, support for the Flume Avro Event, different configuration settings for producers/consumers, and a one-time manual migration of offsets (from Zookeeper to Kafka).

This post makes a compelling argument that the hardest and most impactful feature of distributed systems is multi-tenancy and not scalability. It also describes recent work and plans for multi-tenancy in Kafka.

The Syncsort blog has an interview with Hortonworks' Owen O'Malley. The first part covers the history of Hadoop and the ORC file format, and the second part covers Spark, Tez, and Hive's Live Long and Prosper.

When storing data in an external table, partitions must be manually added to the Hive metastore to appear in results. This post describes how to use AWS Lambda to trigger the addition of the new partitions when data arrives in S3.

This tutorial describes how to use the XML Spark package to read data stored in XML and convert it into JSON.

This post provides a thorough overview of the types of distributed databases, the various trade-offs (and the relation to the CAP theorem), and the features of several major projects. Using this information, the post presents a decision tree for picking a database given application-specific constraints.


Videos from Big Data Day LA 2016 are available on youtube. There are talks covering several Hadoop ecosystem projects, including Apache Beam, Apache Spark, and Apache Kudu.

Altiscale, the big data as a service vendor, is being acquired by SAP for $125 million.

This post asks the question of whether analytic RDBMes are still useful and argues that they are in limited use cases (notably "hard-core business intelligence"). The post has some additional observations about this landscape—notably around Hadoop/Spark and open-source.


Cask announced version 3.5 of the Cask Data Application Platform. This release adds fine-grained authorization, secured impersonation, a GUI-based Spark streaming pipeline builder, and more.

Apache Geode released version 1.0.0-incubating.M3. Geode is an in-memory, distributed storage system with SQL-like capabilities. It powers the commercial product Pivotal GemFire.

Apache Kudu 0.10.0 was announced. This release improves stability of the high availability implementation, improves Spark integration, and more. Another post on the Kudu blog describes new range partitioning features of the release.

Version 1.7.0 of Apache Ignite has been released. Ignite is an in-memory data fabric, and it includes support for SQL queries, data grids, and more. This release includes a number of bug fixes and improvements, including the ability to join non-collocated data.

Apache Hadoop 2.7.3 was released. It includes over 200 resolved issues including support for writing metrics to Graphite, support for extended attributes, improvements to the NFS gateway, modernized web UIs, and more.

WePay has recently written some interesting posts about how they use Kafka and BigQuery. This week, they open-sourced their connector between those systems (built with Kafka Connect), so it's even easier to replicate their setup. The introductory post has many details about the implementation.


Curated by Datadog ( )



Introduction to Apache SparkR by Databricks (San Francisco) - Wednesday, August 31

Heron at Twitter (Santa Clara) - Wednesday, August 31

Implementing Industry Use Cases with Apache Spark: Live Demos! (Palo Alto) - Wednesday, August 31


August Edition of MOHUG (Dublin) - Tuesday, August 30


Building Big Data Solutions in Azure Data Platform (Laurel) - Tuesday, August 30


Building Big Data Solutions in Azure Data Platform (Ashburn) - Wednesday, August 31


Toronto Apache Spark #12 (Toronto) - Wednesday, August 31


Scala Manchester (Manchester) - Wednesday, August 31


From Excel via SQL to MapReduce and SparkSQL by Carsten Langer (Dusseldorf) - Wednesday, August 31


Sustaining the Future of Data (Budapest) - Monday, August 29

Machine Learning with H2O (Budapest) - Friday, September 2


Understanding and Building Big Data Architectures, Part 3: Messaging/Kafka (Hyderabad) - Saturday, September 3

Data Processing at Scale (Bangalore) - Saturday, September 3


Digging into Big Data with Google's BigQuery and Apache Spark (Colombo) - Wednesday, August 31


The Evolution of Apache Hive and an Introduction to Apache Zeppelin (Sydney) - Monday, August 29

Meet the Founders: Alan Gates and Apache Hive (Melbourne) - Tuesday, August 30