Data Eng Weekly

Hadoop Weekly Issue #168

01 May 2016

Kafka Summit was this week in San Francisco, so it's no coincidence that there's a lot of Kafka content in this issue. In addition to that, there are great posts on performance in Impala, Kudu, and Druid. In news, Apache Apex has graduated to a top-level project and Qubole has open-sourced StreamX.


This post gives a quick overview of how operations on Spark RDDs may or may not create new partitions of the data. In particular, mapValues and filter preserve the partitioning but map does not.

This post describes how to use Conda to build a self-contained Python environment (with add-ons like pandas) that can be shipped to nodes in the cluster as part of a Spark job. This allows for running PySpark jobs without having any python libraries natively installed on the host operation system. The solution can also be used with SparkR.

The Datadog blog has a three-part series about monitoring Kafka. The first post has a detailed overview of key metrics for brokers, producers, consumers, and ZooKeeper. The second post describes how to view metrics exposed via JMX using JConsole and other tools, and the third part describes the Datadog integration.

Salesforce has written about the growth of Kafka within their organization. Originally powering analytics of operational metrics, the system now is now transitioning to a platform that'll power a number of systems. Salesforce runs Kafka across many data centers and uses MirrorMaker for replication and aggregation between clusters.

The Metamarkets blog has an interesting post about optimizing large-scale distributed systems. Druid, their distributed data store, recently add a "first-in-first-out" query mode which they've tested across large clusters on heavy load. The team hypothesized what would happen and collected interesting metrics about those hypotheses.

The Google Cloud Big Data blog has a post about BigQuery's internal storage format, Capacitor, and other optimizations the system makes to efficiently store data.

The Apache Kudu (incubating) blog has an overview of some recent performance analysis and tuning done as a result of using the YCSB tool to test the system.

Impala 2.5 has a number of performance improvements that show significant speedups across TPC benchmarks and other areas. The improvements include runtime filters, faster query startup, improved cardinality estimates, LLVM codegen support for SORT and DECIMAL, faster metadata-only queries, and more.

This post describes how to configure MariaDB for the Hive Metastore in order to support high availability.

The Altiscale blog has a post describing the process of finding a bug related to the NodeGroup implementation (this is a follow-up to a post from March). If you've ever been frustrated by not finding the root cause of a Hadoop (or any distributed system) bug, this is a good example of how hard it can be even for developers working at a company selling Hadoop services.

Netflix is currently operating over 4,000 Kafka brokers across 36 clusters. Running Kafka in the cloud requires some tradeoffs, and the team has worked to balance cost and data loss (daily data loss is <0.01%). The blog post shares the team's experience running Kafka in AWS—typical issues (outliers), deployment strategy (small clusters, isolated zookeeper clusters), cluster-level failover, support for AWS availability zones, a Kafka UI visualizer, and more.

The Amazon big data blog has a post about working with encrypted data stored in S3 from Amazon EMR. This integration supports both client side and server side encryption (with the help of Amazon KMS).

TubeMogul has written about the history of their big data platform, which analyzes data from over a trillion requests per month. The team added Amazon EMR early on, introduced Storm for real-time processing, and eventually landed on Qubole for big data as a service.

Caffe, the deep learning framework, has a Spark integration—CaffeOnSpark. MapR has written about running it on a MapR YARN cluster, including some of the performance tuning they performed.


Apache Apex, the big data streaming and batch system, is now a top-level project at the Apache Software Foundation. Apex entered the incubator last August.

Heroku Kafka is a new managed service for Kafka from the folks at Heroku. The product is currently in closed beta.

A post on the MapR blog motivates why gender diversity is important and mentions the Women in Big Data Forum, which aims to encourage women to enter the industry. The Women in Big Data Forum is holding a workshop at MapR in San Jose this week.


StreamX is a new open-source project from Qubole for copying data from Kafka to object stores like Amazon S3. Qubole is offering StreamX as a managed service.

SnappyData is a new platform (and company) for OLAP and OLTP queries on streaming data. SnappyData is powered by Apache Spark and GemFire's in-memory store (which is related to Apache Geode).

Apache Geode (incubating), the distributed data platform targeting high performance and low latency, has released version 1.0.0-incubating.M2. The new version has several new features including site-site WAN connectivity.

Version 0.9.0 of Apache Knox, the REST API gateway for Hadoop, was released. The new version adds UI support for Ranger and Ambari among other improvements and bug fixes.


Curated by Datadog ( )



Apache Hadoop YARN Committers/Contributors Meetup #1 Redux (Santa Clara) - Tuesday, May 3

May 2016 Meetup (San Francisco) - Tuesday, May 3

Technical Workshop on Apache Spark and Apache Drill (San Jose) - Wednesday, May 4

Fault-Tolerant HDFS R/W with Apache Apex + Apex Benchmarks (San Jose) - Wednesday, May 4

Addressing Big Data Security and Governance Requirements with Apache Drill (San Jose) - Thursday, May 5


Data Wrangling & Analysis (Denver) - Tuesday, May 3


Speaker Series: Advanced Spark with Chris Fregly (Salt Lake City) - Monday, May 2


St. Louis Hadoop Users Group Meetup (Saint Louis) - Wednesday, May 4


Real-Time Processing with Flume, Kafka, and Storm (Atlanta) - Wednesday, May 4


May the 4th Be with You: Apache Metron Intro and CodeLab CyberSecurity Analytics (Vienna) - Wednesday, May 4

New Jersey

Flink 1.0: Continuous Streaming Application Framework (Hamilton ) - Tuesday, May 3


Spark in the Cloud (Vancouver) - Tuesday, May 3


Batch Processing, New Tech, and Lightning Talks (London) - Wednesday, May 4


Spark Meetup Auckland May 2016 (Auckland) - Tuesday, May 3