Data Eng Weekly


Hadoop Weekly Issue #200

15 January 2017

This week's issue is #200 and also marks nearly four years of Hadoop Weekly. I certainly couldn't have predicted where the Hadoop ecosystem would be today, but it's clear that Hadoop has staying power. Who knows where things will be in another 200 issues... but chances are we'll still be talking about YARN, Kafka, and Spark (among others) covered in this week's content.

Technical

This article provides an in-depth tutorial walking through the steps of configuring an Apache Hadoop cluster using Apache Ambari. Using LXC and password-less ssh makes for an easy way to do a proof-of-concept.

http://www.themiddlewareshop.com/2017/01/05/creating-a-hadoop-cluster-using-ambari/

The morning paper covered the 2013 publication on YARN from the ACM Symposium on Cloud Computing. If you're unfamiliar with the history/motivation of YARN and some of the core concepts, this is a great recap of those pieces.

https://blog.acolyer.org/2017/01/09/apache-hadoop-yarn-yet-another-resource-negotiator/

The MapR blog has a two-part series on integrating complex event processing into a streaming architecture using the open-source Drools engine. As an example use-case, the tutorial has a script for generating synthetic sensor data related to road traffic. Data is ingested using StreamSets and can be visualized using Kibana.

https://www.mapr.com/blog/better-complex-event-processing-scale-using-microservices-based-streaming-architecture-part-1
https://www.mapr.com/blog/real-time-smart-city-traffic-monitoring-using-microservices-based-streaming-architecture-part-2

Shasta is the name of Google's system for OLAP and OLTP atop of their RDBMS (called F1) and other data systems. A key part of Shasta is the Relational View Language, which is much more flexible than SQL (supporting variables, aliases and more). The morning paper has a summary of the Shasta paper, including the goals of Shasta and RVL as well as how the framework compares to other systems.

https://blog.acolyer.org/2017/01/10/shasta-interactive-reporting-at-scale/

Stream processing system Apache Apex recently added support for SQL by integrating with Apache Calcite. This post describes the integration, describes how to use it, and provides an example code snippet that runs a basic SQL statement.

https://www.datatorrent.com/blog/sql-apache-apex/

In a meta getting started guide for Apache Kafka, this post points to several different pieces of documentation for someone who wants to learn Kafka. There are collections of resources for software engineers, sys admins, data engineers, and more. The post includes some useful experiments to test how well you've understood the getting started content.

https://www.confluent.io/blog/apache-kafka-getting-started/

News

"Practical Data Science with Hadoop and Spark" was published towards the end of last year and is now available on Amazon, Barnes & Noble, and more.

http://www.clustermonkey.net/practical-data-science-with-hadoop-and-spark/

Hortonworks has posted their 2016 year in review, which covers their vendor partners, product updates, cloud initiatives, and business numbers. Of note, they now have over 1,000 employees and over 1,800 companies that have joined Partnerworks.

http://hortonworks.com/blog/hortonworks-2016-year-review/

Spark Summit East in in Boston in just under a month. The Databricks blog highlights five talks from the schedule.

https://databricks.com/blog/2017/01/09/5-cant-miss-talks-at-spark-summit-east-2017.html

Apache Eagle, which is a Hadoop ecosystem project that's flown a bit under my radar, was promoted to be a top-level Apache project this week. Using metrics and logs, Eagle provides security and performance analysis of Hadoop, Spark, YARN, and more.

https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces1

Apache Beam also graduated from the Apache incubator this week. Beam was under incubation for just under a year, and has undergone a few releases in that time.

https://beam.apache.org/blog/2017/01/10/beam-graduates.html

ZDNet has an article that provides some more background on Apache Beam, including its origins at Google. It highlights that while Beam supports Spark Streaming as an execution engine, it's also a potential competitor to it.

http://www.zdnet.com/article/apache-beam-and-spark-new-coopetition-for-squashing-the-lambda-architecture/

This post provides a brief synopsis of 11 books on Apache Spark.

http://blog.matthewrathbone.com/2017/01/13/spark-books.html

Releases

Uber has open-sourced Cherami, their message queue system. It fills a similar role to Amazon SQS (having replaced Celery at Uber) for running tasks. The introductory post describes the architecture and design trade-offs in detail. Notably, it uses RocksDB for storage and Apache Cassandra for storing metadata.

https://eng.uber.com/cherami/
https://github.com/uber/cherami-server

In addition to graduating from the incubator, Apache Beam announced version 0.4.0 this week. The new release adds support for Apache Apex, which is currently supported in embedded (single JVM mode) with YARN support planned as a fast follow-on. The release contains quite a few additional bug fixes and improvements.

https://beam.apache.org/blog/2017/01/09/added-apex-runner.html
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12338590

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Apache Kafka Meetup with Walmart Labs and Confluent (Sunnyvale) - Wednesday, January 18
https://www.meetup.com/http-kafka-apache-org/events/236101053/

Apache Edgent: Streaming and Analytics for IoT Devices (San Francisco) - Wednesday, January 18
https://www.meetup.com/SF-Data-Science/events/236682354/

gRPC, Kubernetes, Mesos, Spark ML, Tensorflow, HDFS, Kafka (San Francisco) - Thursday, January 19
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/227622666/

Washington

Apache Spark Lightning Talks (Seattle) - Tuesday, January 17
https://www.meetup.com/Seattle-Data-Science/events/236720653/

Illinois

Fast Data: Selecting the Right Streaming Technologies for Never-Ending Data Sets (Chicago) - Tuesday, January 17
https://www.meetup.com/ChicagoRealTimeStreamingAnalytics/events/236402865/

Spark Discussion with Dr. Alex Liu, IBM's Chief Data Scientist (Chicago) - Tuesday, January 17
https://www.meetup.com/Chicago-Spark-Users/events/236219654/

Wisconsin

Apache Impala/IoT Data Analytics: Hype or Truly Transformative (Green Bay) - Tuesday, January 17
https://www.meetup.com/BAMDataScience/events/236659617/

Apache Spark Machine Learning Blueprints (Milwaukee) - Wednesday, January 18
https://www.meetup.com/MKE-Big-Data/events/236050498/

Georgia

Building Reactive Fast Data & The Data Lake with Akka, Kafka, Spark (Atlanta) - Tuesday, January 17
https://www.meetup.com/atlantajug/events/229088332/

Evolving Beyond the Data Lake (Atlanta) - Wednesday, January 18
https://www.meetup.com/Atlanta-Hadoop-Users-Group/events/236735574/

Virginia

A State of the Union Panel Discussion on Apache Spark (Tysons) - Thursday, January 19
https://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/235953051/

New York

Fast Data/Event-Driven Architecture (New York) - Thursday, January 19
https://www.meetup.com/ThoughtWorks-Tech-Talks-NYC/events/236733021/

CANADA

Introduction to MapR (Toronto) - Thursday, January 19
https://www.meetup.com/Toronto-MapR-User-Group/events/231648976/

UNITED KINGDOM

Instrumenting Apache Kafka (London) - Wednesday, January 18
https://www.meetup.com/Apache-Kafka-London/events/235512402/

Staging Reactive Data Pipelines Using Kafka as the Backbone (Manchester) - Wednesday, January 18
https://www.meetup.com/manchester-geek-nights/events/236594496/

FRANCE

Meetup #2 (Villeneuve d’Ascq) - Thursday, January 19
https://www.meetup.com/Lille-Big-Data-and-Machine-Learning-Meetup/events/235857735/

BELGIUM

Running Kafka in Production and Streamsets (Brussels) - Wednesday, January 18
https://www.meetup.com/StreamProcessing-be/events/234482615/

NETHERLANDS

Sensor Data Ingestion and Processing with NiFi and Spark (Amsterdam) - Tuesday, January 17
https://www.meetup.com/futureofdata-amsterdam/events/236256466/

GERMANY

How to Apply Machine Learning to Real-Time Processing (Nuremberg) - Tuesday, January 17
https://www.meetup.com/Nuernberg-Big-Data/events/236395922/

CZECH REPUBLIC

Reintroduction to Hadoop Zoo (Prague) - Thursday, January 19
https://www.meetup.com/CS-HUG/events/236695751/

POLAND

Introduction to the Hadoop Ecosystem (Warsaw) - Thursday, January 19
https://www.meetup.com/warsaw-hug/events/236793078/

ISRAEL

Airflow & Luigi: The Flow of Your Data (Tel Aviv-Yafo) - Wednesday, January 18
https://www.meetup.com/Big-things-are-happening-here/events/236091679/