26 June 2016
Hadoop Summit is this week in San Jose so expect to see lots of announcements and presentations (please send any relevant slides my way!) in next week's issue. For this week's newsletter, there are some great posts on Kafka Streams, streaming data to Google BigQuery from Amazon Kinesis, and Google's Dataset Search system.
The Cloudera blog has a post that describes data analysis on fantasy sports data with Apache Spark, Apache Impala (incubating), and Hue. The post mainly focuses on the analysis, but it has a bit of Spark code and demonstrates some of the functionality available in Hue.
KDnuggets has an article that goes through 13 of the main APIs/projects/terms related to Apache Spark. These include RDD, DataFrame, Dataset, structured streaming, GraphX, and Tungsten. There are a few paragraphs on each, which is enough to get a good overview of Spark's main features.
This post from the Confluent blog looks at a simple but non-trivial application of Kafka Streams. Specifically, Kafka Streams is used to write a program that joins user click stream data with user location data. The latter is stored in a KTable, which provides an abstraction similar to a database table with a primary key (the latest value for each primary key is exposed via the APIs). The resulting program is quite simple—only a few lines of code.
The Cloudera blog has a post about meinstadt.de's anomaly detection system for HTTP requests that's built on Apache Flume, Apache Spark Streaming, and Apache Impala (incubating). The code to implement the framework is available on github.
The AWS Big Data blog has a tutorial that shows how to process data from an Amazon Kinesis stream from an Amazon EMR cluster using Apache Spark and Apache Zeppelin. The post includes some example visualizations generated by executing SQL from the Zeppelin notebook.
Apache Kudu (incubating) is close to a 1.0 release that will fully support high availability. This post describes how the last piece of that puzzle, master replication, is implemented. The post also points folks to the JIRA issue that is tracking the work and gives a brief overview of what implementation and testing is remaining.
Google has over 26 billion data sets across all of their data platforms, and they're adding and removing 1.6 billion dataset paths every day. To track, search, and compare datasets, they've developed the Google Dataset Search (GOODS). GOODS tracks metadata, which is exposed via an API, and can be used for search, monitoring, and more.
SiliconAngle has an interview with Hortonworks CEO Rob Bearden. Topics discussed include industry trends, Hortonworks financials, non-Hadoop tech at Hortonworks, and Internet Of Things.
Apache Sentry 1.7.0 was released this week with bug fixes, new features, and improvements. Among them, this release upgrades to v2 of the Hive authorization framework.
DataStax Enterprise 5.0, which is based on Apache Cassandra 3.0, adds support for Graph data, tiered storage,a and multi-instance for Cassandra. The release also includes additional security features like encryption and role based access control.
Driven, the big data application performance monitoring system, has released version 2.2. The highlight of this release is general availability of support for Apache Spark in Driven.
BlueData has announced the release of their EPIC Enterprise Big Data as a Service product for Amazon Web Services. The software can be used for automatically provisioning Docker-based Hadoop clusters with a few clicks.
Apache Accumulo 1.7.2 was released. It includes but fixes to write-ahead log handling, optimizations for RFiles, and minor performance improvements.
Versions 2.11.0 and 3.2.0 of Apache Curator, the high-level SDK for Apache ZooKeeper, has been released.
Apache Hive 2.1.0 was released. It includes a large number of bug fixes and improvements, including changes to Hive's Live Longer and Prosper as well as JDBC support.
Curated by Datadog ( http://www.datadog.com )
Apache Metron Overview and Demo @ Hadoop Summit - Monday, June 27
Apache Accumulo Meetup at Hadoop Summit (San Jose) - Monday, June 27
Building Big Data Applications with Apache Beam and Apache Apex (San Jose) - Monday, June 27
Robust Stream Processing with Apache Flink and Flink-Htm (San Jose) - Monday, June 27
Apache Ambari Meetup at Hadoop Summit (San Jose) - Monday, June 27
War Stories of Making Software Work with Hadoop (San Jose) - Monday, June 27
NiFi Meetup at Hadoop Summit (San Jose) - Monday, June 27
Introduction to Spark In-Memory Computing (Durham) - Tuesday, June 28
How to Use Apache Ignite, In-Memory Data Fabric (New York) - Tuesday, June 28
Toronto Apache Spark #10 (Toronto) - Wednesday, June 29
Crowdmix: An Event-Based Social Music Platform & Kafka 0.10 New Features (London) - Tuesday, June 28
Introduction to Spark 2.0 (Bangalore) - Saturday, July 2
Shanghai BigData Streaming 3rd Meetup (Shanghai) - Saturday, July 2