Data Eng Weekly

Hadoop Weekly Issue #215

07 May 2017

Kafka Summit NYC is this week—if you're attending, please send any relevant presentations/posts my way! Relatedly, there are several posts this week on Kafka and distributed logs as well as posts on Spark (genomics analysis, as a replacement for HPC models, and integration with Kubernetes). There are also quite a few releases this week—including the 2.0 of Apache Kylin, which has some impressive new features for the OLAP system.


Corfu, which is also implemented as an open-source project on github, is a distributed, shared-log. It's different from other systems in that there is no leader—clients write directly to the log. The paper describes a number of optimizations (such as a sequencer node for handing out offsets into the log), and the vCorfu paper adds materialized streams to Corfu. The Morning Paper has a great overview and explanation of the important concepts for both Corfu and vCorfu.

The Cloudera blog has an overview of Hail, which is a tool for doing genomics analysis with Apache Spark. Hail has a simple and powerful programming model, which is demonstrated with example runs to compute quality of the samples and to perform a simple genome-wide association study.

The Confluent blog has an article that proposes using Apache Kafka streaming's KTables as a way to perform windowing of data as alternative to the Apache Beam-model, which is based on triggers and watermarks. The authors argue that this is simpler to reason about, and it is easy to tune (tweaking output buffering and commit latency).

Uber has written about their experimentation platform, which is used to roll out new features to Uber applications. The post describes the statistical tests behind their platform, as well as the architecture which is composed of Hadoop, Vertica, and Scala.

The Morning Paper also covered the CherryPick paper this week. Using a bayesian model, it finds a good cloud config (e.g. instance type, cluster size) to keep cost down. I find this interesting for at least two reasons: 1) the variance the authors find in runtime demonstrates why its important to be careful with vendor benchmarks and 2) it sure would be great if tools started to do these types of optimizations when running on a cloud platform.

This post describes a real-world example of moving an HPC computation to Apache Spark. Interestingly, the existing C++ code is used and wrapped with PySpark (the .so's distributed using Spark's addFiles API).

The Hortonworks blog has a look at some of the recent improvements to the integration between Apache Zeppelin and the Livy job server for running Spark jobs.

Amazon EMR supports running HBase with S3 as the data store instead of HDFS. This is a relatively new feature, though, so a lot of HBase clusters in AWS are still backed by HDFS. This post provides a number of options for migrating an HBase cluster to S3 from HDFS.

Kubernetes has a lot of momentum for cluster orchestration, and the tooling is pretty great for getting a cluster going. Piggybacking on that can actually be a great way to get other distributed systems, like Spark up and running. That's exactly what this post (and companion github repo) describes.

This post explores some of the options for joining streams with Kafka Streams APIs. It describes a local join with a GlobalKTable (when data fits in memory) and a distributed left outer join. The outer join is tricky, so there are a bunch of diagrams to describe what can go wrong and how to fix it.


Software Engineering Daily had an interview this week with Martin Kleppmann, the author of "Data Intensive Applications." If you're a big podcaster, you should queue this one up.

Apache CarbonData, which is a columnar file format for big data, has graduating from the Apache incubator. CarbonData seems to have flown somewhat under the radar (with big names backing Apache Parquet and ORCfile), but it has some impressive features like indexing for random access and support for data update/delete.

The newly branded DataWorks Summit is in just over a month in San Jose. Follow the link below for a discount code.


Apache Flink 1.2.1 was recently released with some critical fixes over the 1.2.0 release.

Apache Mahout 0.13.0 was released this week. The new version has features for running GPU-accelerated computations (including deep learning) and the integration of the ViennaCL linear algebra library.

Apache HBase 1.1.10 was released with several bug fixes, including a zombie-Master fix.

Apache Kylin, which is a OLAP system for big data, has announced their 2.0.0 release. Major new features include Spark Cubing and support for Snowflake schemas.

The 2.1.0-incubating release of Apache Trafodion includes a new python installer, integration with Apache Ambari, performance enhancements, and more (over 300 resolved issues). Trafodion is a transactional SQL-on-Hadoop engine, with support for OLTP workloads.


Curated by Datadog ( )



Big Data Apps Meetup: EDW Optimization, Apache Beam and Apache Nifi! (Palo Alto) - Wednesday, May 10

Data Science in the Enterprise by Amr Awadallah, CTO Cloudera (Palo Alto) - Thursday, May 11


An Elephant in Your Lap? Run a Hadoop Cluster as Docker Containers! (Fairway) - Thursday, May 11


From Hadoop to Spark: Opportunities and Challenges in Making the Switch (Chicago) - Tuesday, May 9


Real-Time Data Analysis with MapR! (Grand Rapids) - Wednesday, May 10

North Carolina

Kafka for .NET Developers (Morrisville) - Wednesday, May 10


Apache Hive 2.0 and Beyond (Arlington) - Tuesday, May 9


The Future of Spark, Spark Intro, and Spark with RapidMiner (Boston) - Thursday, May 11


Big Data Day (Sao Paulo) - Friday, May 12


Apache Flink Meets Apache Ignite (London) - Wednesday, May 10


Big Data, No Fluff: Let’s Get Started with Hadoop #12 (Oslo) - Thursday, May 11


Meetup #6 (Lille) - Thursday, May 11


Spark Meetup May @Google (Sydney) - Tuesday, May 9

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit