Data Eng Weekly


Hadoop Weekly Issue #187

18 September 2016

There's a lot of variety in this week's issue—Kafka, NiFi, Spark, HDFS, Impala and more are all covered in technical articles. And there are a number of releases including a new distributed messaging system, Pulsar, which was recently open-sourced by Yahoo.

Technical

This post describes how to monitor Apache Kafka with Grafana and InfluxDB. jmxtrans is used to deliver metrics from Kafka to Influx.

https://softwaremill.com/monitoring-apache-kafka-with-influxdb-grafana/

The Hortonworks blog has an introduction to the new zero-master clustering that is part of Apache NiFi 1.0.0. Built on Apache ZooKeeper, the clustering mechanism creates a Cluster Coordinator from a pool of servers rather than explicitly designating a master.

http://hortonworks.com/blog/apache-nifi-1-0-0-zero-master-clustering/

The Cloudera blog has a post that demonstrates how to use Apache Spark for text analysis. It looks at the frequency of letters in and mechanisms of decoding an encoded message.

http://blog.cloudera.com/blog/2016/09/solving-real-life-mysteries-with-big-data-and-apache-spark/

The Altiscale blog describes the architecture behind the new and updated HDFS web UI. In addition to a modern look and feel, the browser-based application is powered by REST APIs rather than JSP.

https://www.altiscale.com/blog/improving-the-hadoop-dfs-web-ui/

This tutorial describes how to use mysql triggers with Apache NiFi to capture inserts and updates to rows in a mysql table. With NiFi, that information is sent to Apache Hive for analysis.

http://hortonworks.com/blog/change-data-capture-using-nifi/

The AWS big data blog has a post describing key concepts in Kinesis SQL analytics—particularly windowing and time semantics. The post also includes a tutorial for running a Kinesis analytics application that outputs results to Redshift.

http://blogs.aws.amazon.com/bigdata/post/Tx3QZUFCI71QTOZ/Writing-SQL-on-Streaming-Data-with-Amazon-Kinesis-Analytics-Part-2

Cloudera has posted an article analyzing the speed and cost of Apache Impala (incubating) queries over data stored on EBS volumes vs. S3. In short, EBS is more performant (2.4x in Cloudera's testing) and cheaper (although not by much), but S3 remains a good option for ad hoc analysis using transient clusters.

http://blog.cloudera.com/blog/2016/09/apache-impala-incubating-on-amazon-performance-and-cost-considerations-for-s3-vs-ebs/

Apache Phoenix, the SQL engine for Apache HBase, includes a number of new features in version 4.8.0. Released just over a month ago, the new version adds support for namespace mapping and a new integration with Apache Hive.

http://phoenix.apache.org/namspace_mapping.html
https://phoenix.apache.org/hive_storage_handler.html

Confluent's Log Compaction series has news on upcoming Apache Kafka releases (including details on a new time-based release plan). In addition, there are links to several recent presentations and articles on Kafka.

http://www.confluent.io/blog/log-compaction-highlights-apache-kafka-stream-processing-community-september-2016/

News

Hivemall, which is a machine learning library for Apache Hive, has entered the Apache incubator.

http://incubator.apache.org/projects/hivemall.html

"Fast Data Architectures for Stream Processing Applications" is a free (behind an email-wall) O'Reilly report on the state of stream processing.

https://www.lightbend.com/blog/fast-data-architectures-for-streaming-applications-free-oreilly-mini-book-by-dean-wampler

Releases

Pulsar is a recently open-sourced distributed messaging system from Yahoo. The post looks at its architecture and performance goals/characteristics. Pulsar is widely-used at Yahoo—over 100 billion messages/day are published to its brokers across clusters in over 10 data centers.

https://yahooeng.tumblr.com/post/150078336821/open-sourcing-pulsar-pub-sub-messaging-at-scale

Altiscale has announced support for Apache Spark 2.0.

https://www.altiscale.com/blog/apache-spark-2-0-now-available/

The Eclipse plugin for Apache Pig, pig-eclipse has announced a 1.1.2 release which adds support for Apache Pig 0.15 & 0.16, information when hovering over Pig keywords, and a bug fix for auto-complete.

https://github.com/eyala/pig-eclipse#latest-release

Hortonworks DataFlow 2.0 is now generally available. The post highlights a number of improvements in the release, which is based on Apache NiFi 1.0.

http://hortonworks.com/blog/hortonworks-dataflow-2-0-ga/

Apache HBase 1.2.3 was released. The maintenance release resolves 49 issues

http://mail-archives.us.apache.org/mod_mbox/www-announce/201609.mbox/%3CCADcMMgF_2Qyb-QwGW%2BecZVXK9u_rCKVXKBYNwSpg-%2B1EHrgs7g%40mail.gmail.com%3E

Apache Storm 0.10.2 was released to address several issues in the 0.10.x line.

http://storm.apache.org/2016/09/14/storm0102-released.html

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

#OCBigData Meetup #19 (Irvine) - Wednesday, September 21
http://www.meetup.com/OCBigData/events/233770322/

Talk Night at Carbon Five LA (Santa Monica) - Wednesday, September 21
http://www.meetup.com/Hack-Night-at-Carbon-Five-LA/events/231134290/

Airflow Meetup (San Francisco) - Wednesday, September 21
http://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/events/233316814/

Data Science at Scale with HAWQ and MADlib and Hadoop (San Francisco) - Wednesday, September 21
http://www.meetup.com/futureofdata-sanfrancisco/events/232976655/

Uber Engineering Tech Talk Series (San Francisco) - Thursday, September 22
http://www.meetup.com/UberEvents/events/233955417/

Data Science at Scale with HAWQ and MADlib and Hadoop (Santa Clara) - Thursday, September 22
http://www.meetup.com/futureofdata-siliconvalley/events/232976650/

Washington

IoT with Spark & Scala Spark Datasets (Seattle) - Thursday, September 22
http://www.meetup.com/Seattle-Spark-Meetup/events/229483070/

Texas

Integrating Real-time Data Streams with Spark & Kafka (Plano) - Tuesday, September 20
http://www.meetup.com/Dallas-Big-Data-Science/events/233620352/

Wisconsin

Apache Kudu and Watson Analytics (Green Bay) - Tuesday, September 20
http://www.meetup.com/BAMDataScience/events/229518605/

Ohio

Apache Drill: Just Query It (Columbus) - Thursday, September 22
http://www.meetup.com/Columbus-HUG/events/233132881/

Georgia

Deep Dive with Spark Contributor Chris Fregly (Atlanta) - Thursday, September 22
http://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/233790875/

Virginia

Kick Off the Fall with DC Spark! (McLean) - Thursday, September 22
http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/233557112/

New York

Open House: Big Data Processing with Apache Spark (New York) - Monday, September 19
http://www.meetup.com/Metis-New-York-Data-Science/events/233615253/

Twitter Heron in Practice (New York) - Thursday, September 22
http://www.meetup.com/New-York-City-Storm-User-Group/events/233508409/

CANADA

Streaming in the Time of Spark 2.0 (Montreal) - Wednesday, September 21
http://www.meetup.com/Montreal-Apache-Spark-Meetup/events/233714963/

UNITED KINGDOM

Big Data Architecture Patterns on Google Cloud Platform (Reading) - Wednesday, September 21
http://www.meetup.com/Big-Data-Thames-Valley-Meetup/events/233534739/

Fast Data with Apache Ignite & Apache Spark (London) - Thursday, September 22
http://www.meetup.com/Apache-Ignite-London/events/233394305/

GERMANY

How-to Spark (Hamburg) - Wednesday, September 21
http://www.meetup.com/digitaldojo/events/232049972/

September Meet-up (Karlsruhe) - Thursday, September 22
http://www.meetup.com/Big-Data-User-Group-Karlsruhe-Stuttgart/events/233608992/

CZECH REPUBLIC

First Data: The Big Data Journey of the World’s Biggest Payment Processor (Prague) - Wednesday, September 21
http://www.meetup.com/CS-HUG/events/233936930/

POLAND

Apache Spark Streaming with Apache NiFi and Apache Kafka (Warsaw) - Wednesday, September 21
http://www.meetup.com/futureofdata-warsaw/events/233139200/

INDIA

Interactive Spark Scheduling with Azkaban (Bangalore) - Saturday, September 24
http://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/233870717/

CHINA

Shanghai Big Data Streaming (Shanghai) - Saturday, September 24
http://www.meetup.com/Shanghai-Big-Data-Streaming-Meetup/events/233935427/

AUSTRALIA

Spark Meetup September (Sydney) - Wednesday, September 21
http://www.meetup.com/Sydney-Apache-Spark-User-Group/events/233547400/