Data Eng Weekly

Hadoop Weekly Issue #166

17 April 2016

Hortonworks had a number of announcements this week at Hadoop Summit Europe, and there's coverage of these throughout the newsletter. Apache Storm hit version 1.0.0 with some impressive new features. In technical news, there are multiple articles about scaling Kafka/services built on Kafka and testing of distributed systems. And if you missed Hadoop Summit, videos of many of the presentations have already been posted.


Smyte has written about their infrastructure for realtime spam and fraud detection based on a stream of event data. The initial event processing system was built with Kafka, Redis, Secor, and S3. To scale further and cheaper, they moved to a disk-based solution built using the Redis protocol with RocksDB and Kafka for replication.

This post describes combining rsyslog, Kafka, and AWS with the ELK stack (ElasticSearch, Logstash and Kibana) to address back-pressure, scaling problems, and maintenance issues. The post covers the rsyslog integration with Kafka and schema tricks with ryslog as well as how to run Kafka, Zookeeper, and others in AWS auto scaling groups.

Hortonworks has a post about the upcoming data governance features of Apache Atlas and Apache Range. Among these are classification-based access controls, data expiry-based policies, location-specific policies, prohibition of dataset combinations, and cross component lineage (e.g. tracking data from Kafka to Storm to Hive).

Apache HAWQ (incubating) is the SQL engine based on Greenplum for querying data on HDFS. This post discusses the classic design of and upcoming improvements to HAWQ, including how it differs from Spark and MapReduce, some of the challenges of a classic MPP design with Hadoop, and how HAWQ's new design combines techniques from MPP and batch to get the best of both worlds.

The Cloudera blog has a post describing tools they use for fault injection and network partitioning to test distributed systems like Hadoop. Their AgenTEST tool can inject network problems (e.g. dropping packets), saturate resources (e.g. CPU, IO, disk space), and more. When testing network partitions, they evaluate circular partitioning, bridge partitioning, and more.

The Hortonworks blog has a look ahead to HDP 2.4.2 which will include new versions of Spark and Zeppelin. Looking past that release, it has a preview of Spark 2.0 and upcoming features of Zeppelin.

Cask has written about how they do long-running tests to evaluate correctness of distributed systems before and after infrequent events like region compaction in HBase.

This article describes how to use SparkR with Amazon EMR to do geospatial analysis. Using SparkR's Hive integration, the first step is to define a Hive external table based on data in S3. From there, data can be collected into memory and analyzed directly in R making it easy to produce high-quality visualizations.

MapR has a tutorial of analyzing team-level Major League Baseball statistics using Pig and Hive. Pig is used for initial data munging, and Hive is used to provide SQL-based data querying. Using the Hive ODBC driver and the Hive server, Microsoft Excel can be used to fetch and analyze the data.

SignalFX pushes 70+ billion messages/day through a 27-node Kafka cluster. Based on their experience scaling Kafka to such high volumes, they've shared a number of tips around instrumenting Kafka, when to alert (i.e. when the log flush latency increases and their are under-replicated partitions), and scaling out Kafka.

The dataArtisan's blog has an article about Flink's ability to count streams of data efficiently, at low-latency, and correctly. To demonstrate efficiency, the team has run the recent Yahoo! streaming benchmark at higher throughputs (and a few other changes). In terms of correctness, the post highlights Flink's ability to differentiate event and processing time (using Star Wars movie chronology as an analogy). Finally, the post describes Flink's upcoming ability to query in-memory state of live jobs in an upcoming version.

This tutorial shows how to convert a stream of text data from a tcp socket into a Spark Streaming source.

This post describes how to prevent inadvertent addition of AWS credentials to a patch or git commit when building Hadoop. In addition to some Hadoop-specific recommendations, the post suggests using the git-secrets tool for preventing accidental commits of your access/secret key. If you're using the Hadoop S3 bindings, there's also a call to help evaluate new patches.

Big Data and Brews has interviews with Ted Dunning of MapR and Jacques Nadeau of MapR. Apache Arrow is a topic of both interviews, and Ted also talks about MapR, Drill, and more.


DataEngConf was recently held in San Francisco. This post has an overview of talks from Uber, Stripe, Microsoft, Instacart, and Jawbone. It also describes a major theme of the conference: "Data Science in real world is a product design and engineering discipline."

Hortonworks made several announcements at Hadoop Summit Europe, which took place last week in Dublin. ZDNet has coverage of the highlights, which include an extended partnership with Pivotal (who will now resell HDP), a reseller agreement with Syncosrt, and tech previews of Atlas, Ranger, Zeppelin, and Metron. The article also discusses some of the differences in offerings between Hortonworks, Cloudera, and MapR.

Flink Forward 2016 will be held in Berlin, Germany in September. The call for submissions is open until the end of June.

Videos of presentations from Hadoop Summit Dublin are available on YouTube. As one would expect, there are a presentations covering all different parts of the Hadoop ecosystem.


Metascope is a new tool that provides metadata management alongside of Schedoscope for Hadoop clusters. It provides lots of insight into data, such as the data lineage, via a web interface. It also supports search, inline documentation, a REST API, and more.

Apache HBase 1.2.1, which resolves 27 issues since the original 1.2.0 release, was announced this week. The release announcement highlights four priority fixes.

Version 0.12.0 of Apache Mahout, the machine learning library, has been released. With this release, Mahout's "Samsara" math environment now supports Apache Flink and is platform independent. The release announcement shares more about the Flink integration, known issues, and the project roadmap.

Apache Storm 1.0.0 was released this week. Highlights of the release include improved performance (~3x for most use cases), a new distributed cache API, high availability for the nimbus node, automatic backpressure, dynamic worker profiling, and more.

Apache Kudu (incubating) version 0.8.0 was released this week. This release adds an Apache Flume sink, implements several improvements, and fixes a handful of bugs.

Cloudbreak, the system for provisioning Hadoop clusters on cloud infrastructure using Docker, released version 1.2 this week. New features include support for OpenStack and a new recipe feature for running custom server provisioning scripts.

Cloudera announced Cloudera Enterprise 5.4.10, which has fixes for Flume, Hadoop, HBase, Hive, Impala, and more.

Presto Accumulo is a new project providing a Presto connector for reading/writing data from/to Accumulo.


Curated by Datadog ( )



Open Source Roundtable (Mountain View) - Tuesday, April 19

SDBigData Meetup #14 (San Diego) - Wednesday, April 20

52nd Bay Area Hadoop User GroupMeetup (Sunnyvale) - Wednesday, April 20

High Performance Spark +Spark Committers +Internals +Perf Tuning +Profiling (San Francisco) - Thursday, April 21

Data in Motion: Simplifying Security & Building Custom Integrations (Palo Alto) - Thursday, April 21

Hands-On Introduction to Spark & Zeppelin (Santa Clara) - Thursday, April 21


CassieQ: A Distributed Queue Built on Cassandra (Bellevue) - Monday, April 18

GraphFrames, Survival Analysis, and SnappyData + Spark (Bellevue) - Wednesday, April 20

Going Beyond Apache HBase (Bellevue) - Wednesday, April 20

Putting Apache Drill to Use: Best Practices for Production Deployments (Seattle) - Thursday, April 21


ETL to ML: Use Apache Spark as an End-to-End Tool for Advanced Analytics (Chicago) - Monday, April 18

Apache Flink 1.0: A New Era for Real-Time Streaming Analytics (Chicago) - Tuesday, April 19

Spark & H2O: Sparkling Water + Cybersecurity with Machine Learning (Chicago) - Tuesday, April 19


How HealthTrust Is Using Cloudera (Franklin) - Wednesday, April 20


Real-Time Data Processing Using ZooKeeper, Kafka, and Storm (Huntsville) - Tuesday, April 19


Apache Flink 1.0 (McLean) - Wednesday, April 20


Socialize and Discuss the Roadmap of Azure Services (Chevy Chase) - Monday, April 18

District of Columbia

Hilton Presents Their Hadoop Journey & Apache Calcite Introduction (Washington) - Wednesday, April 20

New Jersey

Hands-On Workshop: Scala + Spark (Hamilton) - Tuesday, April 19

New York

SnappyData: Create Robust Analytic Applications with Spark Streaming (New York) - Wednesday, April 20

Apache Flink 1.0 (New York) - Thursday, April 21