18 September 2016
There's a lot of variety in this week's issue—Kafka, NiFi, Spark, HDFS, Impala and more are all covered in technical articles. And there are a number of releases including a new distributed messaging system, Pulsar, which was recently open-sourced by Yahoo.
This post describes how to monitor Apache Kafka with Grafana and InfluxDB. jmxtrans is used to deliver metrics from Kafka to Influx.
https://softwaremill.com/monitoring-apache-kafka-with-influxdb-grafana/
The Hortonworks blog has an introduction to the new zero-master clustering that is part of Apache NiFi 1.0.0. Built on Apache ZooKeeper, the clustering mechanism creates a Cluster Coordinator from a pool of servers rather than explicitly designating a master.
http://hortonworks.com/blog/apache-nifi-1-0-0-zero-master-clustering/
The Cloudera blog has a post that demonstrates how to use Apache Spark for text analysis. It looks at the frequency of letters in and mechanisms of decoding an encoded message.
http://blog.cloudera.com/blog/2016/09/solving-real-life-mysteries-with-big-data-and-apache-spark/
The Altiscale blog describes the architecture behind the new and updated HDFS web UI. In addition to a modern look and feel, the browser-based application is powered by REST APIs rather than JSP.
https://www.altiscale.com/blog/improving-the-hadoop-dfs-web-ui/
This tutorial describes how to use mysql triggers with Apache NiFi to capture inserts and updates to rows in a mysql table. With NiFi, that information is sent to Apache Hive for analysis.
http://hortonworks.com/blog/change-data-capture-using-nifi/
The AWS big data blog has a post describing key concepts in Kinesis SQL analytics—particularly windowing and time semantics. The post also includes a tutorial for running a Kinesis analytics application that outputs results to Redshift.
Cloudera has posted an article analyzing the speed and cost of Apache Impala (incubating) queries over data stored on EBS volumes vs. S3. In short, EBS is more performant (2.4x in Cloudera's testing) and cheaper (although not by much), but S3 remains a good option for ad hoc analysis using transient clusters.
Apache Phoenix, the SQL engine for Apache HBase, includes a number of new features in version 4.8.0. Released just over a month ago, the new version adds support for namespace mapping and a new integration with Apache Hive.
http://phoenix.apache.org/namspace_mapping.html
https://phoenix.apache.org/hive_storage_handler.html
Confluent's Log Compaction series has news on upcoming Apache Kafka releases (including details on a new time-based release plan). In addition, there are links to several recent presentations and articles on Kafka.
Hivemall, which is a machine learning library for Apache Hive, has entered the Apache incubator.
http://incubator.apache.org/projects/hivemall.html
"Fast Data Architectures for Stream Processing Applications" is a free (behind an email-wall) O'Reilly report on the state of stream processing.
Pulsar is a recently open-sourced distributed messaging system from Yahoo. The post looks at its architecture and performance goals/characteristics. Pulsar is widely-used at Yahoo—over 100 billion messages/day are published to its brokers across clusters in over 10 data centers.
https://yahooeng.tumblr.com/post/150078336821/open-sourcing-pulsar-pub-sub-messaging-at-scale
Altiscale has announced support for Apache Spark 2.0.
https://www.altiscale.com/blog/apache-spark-2-0-now-available/
The Eclipse plugin for Apache Pig, pig-eclipse
has announced a 1.1.2 release which adds support for Apache Pig 0.15 & 0.16, information when hovering over Pig keywords, and a bug fix for auto-complete.
https://github.com/eyala/pig-eclipse#latest-release
Hortonworks DataFlow 2.0 is now generally available. The post highlights a number of improvements in the release, which is based on Apache NiFi 1.0.
http://hortonworks.com/blog/hortonworks-dataflow-2-0-ga/
Apache HBase 1.2.3 was released. The maintenance release resolves 49 issues
Apache Storm 0.10.2 was released to address several issues in the 0.10.x line.
http://storm.apache.org/2016/09/14/storm0102-released.html
Curated by Datadog ( http://www.datadog.com )
#OCBigData Meetup #19 (Irvine) - Wednesday, September 21
http://www.meetup.com/OCBigData/events/233770322/
Talk Night at Carbon Five LA (Santa Monica) - Wednesday, September 21
http://www.meetup.com/Hack-Night-at-Carbon-Five-LA/events/231134290/
Airflow Meetup (San Francisco) - Wednesday, September 21
http://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/events/233316814/
Data Science at Scale with HAWQ and MADlib and Hadoop (San Francisco) - Wednesday, September 21
http://www.meetup.com/futureofdata-sanfrancisco/events/232976655/
Uber Engineering Tech Talk Series (San Francisco) - Thursday, September 22
http://www.meetup.com/UberEvents/events/233955417/
Data Science at Scale with HAWQ and MADlib and Hadoop (Santa Clara) - Thursday, September 22
http://www.meetup.com/futureofdata-siliconvalley/events/232976650/
IoT with Spark & Scala Spark Datasets (Seattle) - Thursday, September 22
http://www.meetup.com/Seattle-Spark-Meetup/events/229483070/
Integrating Real-time Data Streams with Spark & Kafka (Plano) - Tuesday, September 20
http://www.meetup.com/Dallas-Big-Data-Science/events/233620352/
Apache Kudu and Watson Analytics (Green Bay) - Tuesday, September 20
http://www.meetup.com/BAMDataScience/events/229518605/
Apache Drill: Just Query It (Columbus) - Thursday, September 22
http://www.meetup.com/Columbus-HUG/events/233132881/
Deep Dive with Spark Contributor Chris Fregly (Atlanta) - Thursday, September 22
http://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/233790875/
Kick Off the Fall with DC Spark! (McLean) - Thursday, September 22
http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/233557112/
Open House: Big Data Processing with Apache Spark (New York) - Monday, September 19
http://www.meetup.com/Metis-New-York-Data-Science/events/233615253/
Twitter Heron in Practice (New York) - Thursday, September 22
http://www.meetup.com/New-York-City-Storm-User-Group/events/233508409/
Streaming in the Time of Spark 2.0 (Montreal) - Wednesday, September 21
http://www.meetup.com/Montreal-Apache-Spark-Meetup/events/233714963/
Big Data Architecture Patterns on Google Cloud Platform (Reading) - Wednesday, September 21
http://www.meetup.com/Big-Data-Thames-Valley-Meetup/events/233534739/
Fast Data with Apache Ignite & Apache Spark (London) - Thursday, September 22
http://www.meetup.com/Apache-Ignite-London/events/233394305/
How-to Spark (Hamburg) - Wednesday, September 21
http://www.meetup.com/digitaldojo/events/232049972/
September Meet-up (Karlsruhe) - Thursday, September 22
http://www.meetup.com/Big-Data-User-Group-Karlsruhe-Stuttgart/events/233608992/
First Data: The Big Data Journey of the World’s Biggest Payment Processor (Prague) - Wednesday, September 21
http://www.meetup.com/CS-HUG/events/233936930/
Apache Spark Streaming with Apache NiFi and Apache Kafka (Warsaw) - Wednesday, September 21
http://www.meetup.com/futureofdata-warsaw/events/233139200/
Interactive Spark Scheduling with Azkaban (Bangalore) - Saturday, September 24
http://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/233870717/
Shanghai Big Data Streaming (Shanghai) - Saturday, September 24
http://www.meetup.com/Shanghai-Big-Data-Streaming-Meetup/events/233935427/
Spark Meetup September (Sydney) - Wednesday, September 21
http://www.meetup.com/Sydney-Apache-Spark-User-Group/events/233547400/