15 January 2017
This week's issue is #200 and also marks nearly four years of Hadoop Weekly. I certainly couldn't have predicted where the Hadoop ecosystem would be today, but it's clear that Hadoop has staying power. Who knows where things will be in another 200 issues... but chances are we'll still be talking about YARN, Kafka, and Spark (among others) covered in this week's content.
This article provides an in-depth tutorial walking through the steps of configuring an Apache Hadoop cluster using Apache Ambari. Using LXC and password-less ssh makes for an easy way to do a proof-of-concept.
The morning paper covered the 2013 publication on YARN from the ACM Symposium on Cloud Computing. If you're unfamiliar with the history/motivation of YARN and some of the core concepts, this is a great recap of those pieces.
The MapR blog has a two-part series on integrating complex event processing into a streaming architecture using the open-source Drools engine. As an example use-case, the tutorial has a script for generating synthetic sensor data related to road traffic. Data is ingested using StreamSets and can be visualized using Kibana.
Shasta is the name of Google's system for OLAP and OLTP atop of their RDBMS (called F1) and other data systems. A key part of Shasta is the Relational View Language, which is much more flexible than SQL (supporting variables, aliases and more). The morning paper has a summary of the Shasta paper, including the goals of Shasta and RVL as well as how the framework compares to other systems.
Stream processing system Apache Apex recently added support for SQL by integrating with Apache Calcite. This post describes the integration, describes how to use it, and provides an example code snippet that runs a basic SQL statement.
In a meta getting started guide for Apache Kafka, this post points to several different pieces of documentation for someone who wants to learn Kafka. There are collections of resources for software engineers, sys admins, data engineers, and more. The post includes some useful experiments to test how well you've understood the getting started content.
"Practical Data Science with Hadoop and Spark" was published towards the end of last year and is now available on Amazon, Barnes & Noble, and more.
Hortonworks has posted their 2016 year in review, which covers their vendor partners, product updates, cloud initiatives, and business numbers. Of note, they now have over 1,000 employees and over 1,800 companies that have joined Partnerworks.
Spark Summit East in in Boston in just under a month. The Databricks blog highlights five talks from the schedule.
Apache Eagle, which is a Hadoop ecosystem project that's flown a bit under my radar, was promoted to be a top-level Apache project this week. Using metrics and logs, Eagle provides security and performance analysis of Hadoop, Spark, YARN, and more.
Apache Beam also graduated from the Apache incubator this week. Beam was under incubation for just under a year, and has undergone a few releases in that time.
ZDNet has an article that provides some more background on Apache Beam, including its origins at Google. It highlights that while Beam supports Spark Streaming as an execution engine, it's also a potential competitor to it.
This post provides a brief synopsis of 11 books on Apache Spark.
Uber has open-sourced Cherami, their message queue system. It fills a similar role to Amazon SQS (having replaced Celery at Uber) for running tasks. The introductory post describes the architecture and design trade-offs in detail. Notably, it uses RocksDB for storage and Apache Cassandra for storing metadata.
In addition to graduating from the incubator, Apache Beam announced version 0.4.0 this week. The new release adds support for Apache Apex, which is currently supported in embedded (single JVM mode) with YARN support planned as a fast follow-on. The release contains quite a few additional bug fixes and improvements.
Curated by Datadog ( http://www.datadog.com )
Apache Kafka Meetup with Walmart Labs and Confluent (Sunnyvale) - Wednesday, January 18
Apache Edgent: Streaming and Analytics for IoT Devices (San Francisco) - Wednesday, January 18
gRPC, Kubernetes, Mesos, Spark ML, Tensorflow, HDFS, Kafka (San Francisco) - Thursday, January 19
Apache Spark Lightning Talks (Seattle) - Tuesday, January 17
Fast Data: Selecting the Right Streaming Technologies for Never-Ending Data Sets (Chicago) - Tuesday, January 17
Spark Discussion with Dr. Alex Liu, IBM's Chief Data Scientist (Chicago) - Tuesday, January 17
Apache Impala/IoT Data Analytics: Hype or Truly Transformative (Green Bay) - Tuesday, January 17
Apache Spark Machine Learning Blueprints (Milwaukee) - Wednesday, January 18
Building Reactive Fast Data & The Data Lake with Akka, Kafka, Spark (Atlanta) - Tuesday, January 17
Evolving Beyond the Data Lake (Atlanta) - Wednesday, January 18
A State of the Union Panel Discussion on Apache Spark (Tysons) - Thursday, January 19
Fast Data/Event-Driven Architecture (New York) - Thursday, January 19
Introduction to MapR (Toronto) - Thursday, January 19
Instrumenting Apache Kafka (London) - Wednesday, January 18
Staging Reactive Data Pipelines Using Kafka as the Backbone (Manchester) - Wednesday, January 18
Meetup #2 (Villeneuve d’Ascq) - Thursday, January 19
Running Kafka in Production and Streamsets (Brussels) - Wednesday, January 18
Sensor Data Ingestion and Processing with NiFi and Spark (Amsterdam) - Tuesday, January 17
How to Apply Machine Learning to Real-Time Processing (Nuremberg) - Tuesday, January 17
Reintroduction to Hadoop Zoo (Prague) - Thursday, January 19
Introduction to the Hadoop Ecosystem (Warsaw) - Thursday, January 19
Airflow & Luigi: The Flow of Your Data (Tel Aviv-Yafo) - Wednesday, January 18