Data Eng Weekly


Hadoop Weekly Issue #208

12 March 2017

Strata+Hadoop World is this week in San Jose. Please keep an eye out for talks and announcements to send my way (some have already snuck in the week before). But in the mean time, there are two great papers this week covered by The Morning Paper, several articles on YARN, a post about the history of the HDFS audit log, and a discussion of HBase optimizations at Alibaba.

Technical

The StreamSets data collector can be used to do change data capture from MongoDB and write to a configurable destination system. This post shows how you can mirror your data to MapR DB as JSON.

https://streamsets.com/blog/json-mapr-db-streamsets-data-collector/

The Morning Paper has a summary of the HopFS paper that was recently presented at the USENIX conference on File and Storage Technology (FAST). Based on Apache Hadoop DFS version 2.0.4, HopFS replaces the NameNode/JournalNode/Zookeeper cluster with stateless namenodes that store metadata in a MySQL cluster for high availability. The summary has some interesting notes on the more subtle design considerations for locking/transactions. The authors analyzed performance based on real-world recordings of metadata operations on a large Hadoop cluster at Spotify.

https://blog.acolyer.org/2017/03/06/hopfs-scaling-hierarchical-file-system-metadata-using-newsql-databases/

Hortonworks has two posts on the future of YARN. The first introduces the notion of "multi-colored" YARN, which is the promise that YARN can take on even more types of applications and workloads, especially driven by work to add Docker support. The next article describes one of these workloads—a GPU-heavy TensorFlow application, and provides a tutorial and sample code for getting started with TensorFlow on YARN.

https://hortonworks.com/blog/data-lake-3-0-part-2-multi-colored-yarn/
https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/

Using the Attach API extension to the JVM, the application described in this post monitors the heap values in a JMX MBean to detect when an out of memory situation is about to happen.

https://developer.ibm.com/hadoop/2017/03/06/monitor-memory-usage-notify-email/

Automated testing with a distributed system is usually pretty tricky. This post describes a setup involving Docker, Gitlab CI, and Druid for testing via Python.

http://blog.godatadriven.com/continuous-integration-with-druid

Another paper from FAST looks at what happens to several popular distributed systems (including Kafka, ZooKeeper, and Cassandra) in the face of data corruption or local FS errors. Of the systems tested, everyone one experienced some kind of failure due to assumptions about the underlying storage.

https://blog.acolyer.org/2017/03/08/redundancy-does-not-imply-fault-tolerance-analysis-of-distributed-storage-reactions-to-single-errors-and-corruptions/

The HDFS audit log is a great tool for tracking and debugging what is going on in your HDFS cluster. This post describes how it came out of a need at Yahoo! (and includes some interesting history about the early days of Hadoop).

https://effectivemachines.com/2017/03/08/unofficial-history-of-the-hdfs-audit-log/

The Apache Software Foundation blog has a post from Alibaba about their usage of HBase, which powers a number of different applications in their e-commerce search infrastructure. Recently, the undertook an initiative to backport a patch to improve throughput and reduce garbage collection by avoiding the heap when reading data from the HBase Level 2 Bucket Cache. The post has more details on the internals of HBase caching, and it shares the type of improvement they saw from the implementation.

https://blogs.apache.org/hbase/entry/offheap-read-path-in-production

The AWS Big Data blog has a post that puts together Amazon Kinesis Firehouse, Amazon Athena, and Amazon QuickSight to analyze VPC Flow Logs. If you've ever tried to analyze these logs by hand like I have, then you're probably excited by the potential of this setup.

https://aws.amazon.com/blogs/big-data/analyzing-vpc-flow-logs-with-amazon-kinesis-firehose-amazon-athena-and-amazon-quicksight/

This post has an in-depth look at YARN node labels. It describes how they work at a high-level, how they relate to the capacity scheduler's job queues, the settings that can be used adjust labels with/queues, and more.

https://developer.ibm.com/hadoop/2017/03/10/yarn-node-labels/

News

Rumors are swirling that Cloudera is about to announce their IPO.

https://www.bloomberg.com/news/articles/2017-03-09/cloudera-said-to-tap-morgan-stanley-jpmorgan-bofa-for-ipo

Confluent announced that they've raised a $50 million series C. The post by Confluent CEO Jay Kreps describes what Confluent has been up to over the past year.

https://www.confluent.io/blog/the-arrival-of-streaming-confluent-raises-50-million-series-c/

Releases

Apache Ignite, the in-memory data grid, released version 1.9 with support for Kubernetes, better API support for .NET and C++, improved compatibility with Spark, and more.

https://blogs.apache.org/ignite/entry/apache-ignite-1-9-released

Unravel has announced a new version of their platform which includes an Application Performance Monitoring component. They're touting it as the first such system for big data.

http://sdtimes.com/unravel-data-launches-version-4-0-industry-first-intelligent-application-performance-management-platform-big-data/

PepperData also announced a new application profiler tool for Hadoop based on the open-source Dr. Elephant.

http://www.pepperdata.com/newsroom/pr_030717/

The new version of BlueData EPIC supports hybrid clouds where some data/clusters are in AWS and some are on-premise.

https://www.bluedata.com/blog/2017/03/big-data-analytics-hybrid-cloud/

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

SF Spark+SF Hadoop Joint Meetup at Strata: Apache Spot and Kudu (San Francisco) - Monday, March 13
https://www.meetup.com/hadoopsf/events/237145958/

Airflow Meetup (San Jose) - Tuesday, March 14
https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/events/237412864/

Apache Kafka Meetup at Strata San Jose (San Jose) - Tuesday, March 14
https://www.meetup.com/http-kafka-apache-org/events/237763007/

Big Data Science @ Strata + Hadoop World (San Jose) - Tuesday, March 14
https://www.meetup.com/Big-Data-Science/events/234127850/

Hadoop on Bare Metal + Data Prep with Spark (Irvine) - Wednesday, March 15
https://www.meetup.com/OCBigData/events/236400777/

Dr. Elephant Meetup: Strata + Hadoop World Edition (Sunnyvale) - Thursday, March 16
https://www.meetup.com/Dr-Elephant-Meetup/events/238059233/

100% Streaming!!! Google Cloud + Apache Beam + Flink (San Francisco) - Thursday, March 16
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/233979047/

Ohio

Spark Meetup 2017 #1 (Mason) - Wednesday, March 15
https://www.meetup.com/Cincinnati-Apache-Spark-Meetup/events/238027342/

New Jersey

NiFi: Flow Programming and Financial Trade Routing (Hamilton) - Thursday, March 16
https://www.meetup.com/nj-hadoop/events/238211454/

CANADA

Spark on AWS (Montreal) - Wednesday, March 15
https://www.meetup.com/Montreal-Apache-Spark-Meetup/events/237863945/

COLOMBIA

Hadoop Deep Dive Part I: Motivation and Fundamentals (Medellin) - Tuesday, March 14
https://www.meetup.com/Arquitectura-Big-Data-Medellin/events/237844572/

FRANCE

Hadoop Meetup with Cloudera and Synaltic (Paris) - Wednesday, March 15
https://www.meetup.com/Hadoop-User-Group-France/events/237938710/

Apache Flink: Use Cases (Neuilly-Sur-Seine) - Wednesday, March 15
https://www.meetup.com/Paris-Apache-Flink-Meetup/events/237577105/

ROMANIA

10th BigData/DataScience­­­ Meetup (Cluj-Napoca) - Wednesday, March 15
https://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/events/237930652/

SINGAPORE

Spark Runbook: Learnings From 5 Years of Spark for Analytics (Singapore) - Monday, March 13
https://www.meetup.com/Spark-Singapore/events/237891148/