Data Eng Weekly

Hadoop Weekly Issue #165

10 April 2016

This week, there were a number of big releases including new open source projects from LinkedIn and Airbnb. There's quite a bit of technical content covering stream processing—Spark, Flink, Kafka, and more. In news, the conference program for both Spark Summit and HBaseCon have been released.


Zalando has published a post about how they choose Apache Flink as their stream processing framework. The post talks about the evaluation criteria for and proof of concepts built towards the decision, and it describes the major reasons—consistently low latencies at high throughputs, true stream processing, and developer support.

The Cloudera blog has a post from developers of, where they describe their real-time infrastructure built on Kafka, HBase, Drools, and Spark. In addition to describing the flow of data, they describe how they optimized HBase lookups and serialization, data locality between HBase and Spark, and Spark computation.

InfoQ has a presentation and video about streaming at scale with the SMACK (Spark, Mesos, Akka, Cassandra, and Kafka) stack. Among the topics discussed, the presentation describes why a stack like this solves the same problems as the Lambda Architecture much more simply.

The Confluent "Log Compaction" blog series has an update on what's happened with the Kafka project in March. There are a number of interesting developments, including progress on rack awareness, Kerberos support, and time-based indexes in Kafka. Lots of great content if you (like me) don't have time to keep up with the latest development efforts.

Apache Flink 1.0 introduced a new complex event processing (CEP) library. For those who aren't familiar, CEP offers a way to (among other things) detect patterns of events. This post introduces Flink's CEP Pattern APIs though a potential use-case of anomaly detection based on sensor readings from servers in a data center.

The Genome Analysis Toolkit (GATK) recently announced that its next release (currently in alpha) will support Apache Spark. This post gives a brief introduction to the toolkit and shows how Spark is leveraged to detect duplicate DNA fragments.

InfoWorld has an overview of the plans for structured streaming, which is part of Spark 2.0. While microbatch will still be around, there are useful new primitives like infinite data frames and first-class support for repeated queries.

The AWS big data blog has a post on loading data into S3 and Redshift using encryption keys stored in the AWS Key Management Service (KMS). In addition to the required steps, the post describes the how encryption with KMS keys works for data in AWS S3.

The Confluent blog describes how to use Kafka Connect and Kafka Streams for a non-trivial "hello world" program. Specifically, the example program pulls Wikipedia data from IRC, parses the messages, and computes various statistics. The post has a number of code snippets showing how the entire process is implemented.

This post walks through converting some simple schemas from Postgres to Cassandra, and it describes several of the major differences—replication, data types (no JSON support in Cassandra), primary keys, and eventual consistency.


The ESG blog has a recap of the recent Strata+Hadoop World conference. It notes some themes of the conference, such as building momentum for Spark, machine learning, and cloud services.

InformationWeek also has a recap from Strata, focussing on Keynotes from MapR, from Pivotal, on artificial intelligence, and more.

The agenda for Spark Summit 2016, which will be held from June 6-8 in San Francisco, has been announced. The conference has two days of session spread across five tracks.

Forbes has an interview with Cloudera CEO Tom Reilly, in which he discusses the companies biggest opportunity, the competitive market, plans to take the company public, and more.

Datanami has an article on the rise of Apache Kafka as the backbone for stream processing. It includes an interview with Confluent co-founder and CTO Neha Narkhede in which she discusses the recently launched Kafka Connect and Kafka Streams.

HBaseCon takes place in San Francisco on May 24th, and the agenda has just been announced. There are 20+ sessions across three tracks.


Apache HBase 0.98.18 and 1.1.4 were both recently released. The 1.1.4 release has a number of fixes including nine or so correctness fixes. The 0.98.18 release has just shy of 50 resolved issues (bugs, improvements, and two new features).

Apache Lens, the unified analytics interface, which has support for the Hadoop ecosystem (and many other) execution engines and data stores, released 2.5.0-beta. This release resolves 87 tickets, with a focus on bug fixes and improvements over new features.

Airbnb has open-sourced Caravel, their data exploration system. Caravel supports a number of features found in commercial products and can be hooked up to any system that supports an SQL-dialect (via SQLalchemy). Notably, it supports Druid for real-time analytics.

MapR has announced support for Apache Drill 1.6 for their distribution. Highlights of the release include a new storage plugin for MapR-DB, new SQL window function support, and end-to-end security. The introduction has some examples of using the MapR-DB API to load data and then querying it with Drill.

Apache Flink has announced a bugfix release for the 1.0.x line. The release resolves 23 issues and is recommended for all users of 1.0.0.

Cloudera Enterprise 5.7 was released with updates to Spark, HBase, Impala, Kafka, and more. Highlights include the promotion of Hive-on-Spark and HBase-Spark from Cloudera Labs, major performance improvements for Impala, and support for the HBase WAL on SSD.

Apache Tajo, the data warehouse system built on Hadoop, released version 0.11.2. The new version adds support for Kerberos, fixes ORC table support for Hive, and more.

LinkedIn has open-sourced Dr. Elephant, their tool for diagnosing performance issues with Hadoop and Spark jobs. Based on metrics collected from the YARN resource manager on completed jobs, Dr. Elephant evaluates heuristics to generate diagnostic reports for things like data skew, GC overhead, and more. LinkedIn reports that it solves around 80 percent of problems.


Curated by Datadog ( )



Mohammed Guller: Demystifying Big Data and Apache Spark (Redwood City) - Monday, April 11

IOT Big Data Ingestion and Processing in Hadoop by Silver Spring Networks (San Jose) - Thursday, April 14


Seattle Apache Kafka Meetup (Bellevue) - Friday, April 15


Hadoop Operations for Production Systems (Eden Prairie) - Wednesday, April 13


Apache Kafka and the Confluent Platform: Overview and Roadmap, with Jay Kreps (Chicago) - Thursday, April 14


Managing Automotive Sensor Big Data Using Hadoop (Ann Arbor) - Tuesday, April 12


Understanding Spark Streaming (Philadelphia) - Thursday, April 14

New Jersey

Real-Time Aggregation, Approximation, Similarities, and Recommendations at Scale (Princeton) - Thursday, April 14

IRELAND OrientDB: Unlock the Value of Document Data Relationships + Apache Spark & GraphX (Dublin) - Monday, April 11

Hadoop Summit Dublin: Distro and ALOJA Big Data Benchmarking (Dublin) - Tuesday, April 12

Data Flow Using Apache NiFi (Dublin) - Tuesday, April 12

Hands-On Introduction to Spark & Zeppelin (Dublin) - Tuesday, April 12

Hadoop and MongoDB Scaling at Datahug and RAFTlike MongoDB Elections (Dublin) - Tuesday, April 12

Hadoop Summit Night (Dublin) - Tuesday, April 12


Real-time Search and Insights with Apache Kafka (London) - Wednesday, April 13


Spark Kick Off Meetup (Munich) - Thursday, April 14


18th Swiss Big Data User Group Meeting (Zurich) - Monday, April 11


Data Processing @SCALE (Tel Aviv-Yafo) - Monday, April 11


Ingesting Unbounded File Data + Streaming Log Analysis Using Apex (Pune) - Wednesday, April 13


Hadoopy Birthday: Hadoop Turns 10, with Doug Cutting, Father of Hadoop (Singapore) - Monday, April 11


First Organizational Meeting (Christchurch) - Thursday, April 14