Data Eng Weekly


Hadoop Weekly Issue #176

26 June 2016

Hadoop Summit is this week in San Jose so expect to see lots of announcements and presentations (please send any relevant slides my way!) in next week's issue. For this week's newsletter, there are some great posts on Kafka Streams, streaming data to Google BigQuery from Amazon Kinesis, and Google's Dataset Search system.

Technical

Shine has written about how they use Amazon Lambda and Amazon Kinesis, along with the Kinesis agent for the Apache web server, to move data from EC2 to Google BigQuery. The post has code snippets of Lambda functions (written in javascript), information on scale and cost, and a description of how to optimize transfer costs by gzip'ing data.

https://blog.shinetech.com/2016/06/21/kinesis-lambda-bigquery/

The Cloudera blog has a post that describes data analysis on fantasy sports data with Apache Spark, Apache Impala (incubating), and Hue. The post mainly focuses on the analysis, but it has a bit of Spark code and demonstrates some of the functionality available in Hue.

http://blog.cloudera.com/blog/2016/06/how-to-analyze-fantasy-sports-with-apache-spark-and-sql-part-2-data-exploration/

KDnuggets has an article that goes through 13 of the main APIs/projects/terms related to Apache Spark. These include RDD, DataFrame, Dataset, structured streaming, GraphX, and Tungsten. There are a few paragraphs on each, which is enough to get a good overview of Spark's main features.

http://www.kdnuggets.com/2016/06/spark-key-terms-explained.html

This post from the Confluent blog looks at a simple but non-trivial application of Kafka Streams. Specifically, Kafka Streams is used to write a program that joins user click stream data with user location data. The latter is stored in a KTable, which provides an abstraction similar to a database table with a primary key (the latest value for each primary key is exposed via the APIs). The resulting program is quite simple—only a few lines of code.

http://www.confluent.io/blog/distributed-real-time-joins-and-aggregations-on-user-activity-events-using-kafka-streams

The Cloudera blog has a post about meinstadt.de's anomaly detection system for HTTP requests that's built on Apache Flume, Apache Spark Streaming, and Apache Impala (incubating). The code to implement the framework is available on github.

http://blog.cloudera.com/blog/2016/06/how-to-detect-and-report-web-traffic-anomalies-in-near-real-time/

The AWS Big Data blog has a tutorial that shows how to process data from an Amazon Kinesis stream from an Amazon EMR cluster using Apache Spark and Apache Zeppelin. The post includes some example visualizations generated by executing SQL from the Zeppelin notebook.

http://blogs.aws.amazon.com/bigdata/post/Tx3K805CZ8WFBRP/Analyze-Realtime-Data-from-Amazon-Kinesis-Streams-Using-Zeppelin-and-Spark-Strea

Apache Kudu (incubating) is close to a 1.0 release that will fully support high availability. This post describes how the last piece of that puzzle, master replication, is implemented. The post also points folks to the JIRA issue that is tracking the work and gives a brief overview of what implementation and testing is remaining.

http://kudu.apache.org/2016/06/24/multi-master-1-0-0.html

Google has over 26 billion data sets across all of their data platforms, and they're adding and removing 1.6 billion dataset paths every day. To track, search, and compare datasets, they've developed the Google Dataset Search (GOODS). GOODS tracks metadata, which is exposed via an API, and can be used for search, monitoring, and more.

http://dl.acm.org/citation.cfm?id=2903730

News

SiliconAngle has an interview with Hortonworks CEO Rob Bearden. Topics discussed include industry trends, Hortonworks financials, non-Hadoop tech at Hortonworks, and Internet Of Things.

http://siliconangle.com/blog/2016/06/24/hadoop-and-beyond-a-conversation-with-hortonworks-ceo-rob-bearden/

Releases

Apache Sentry 1.7.0 was released this week with bug fixes, new features, and improvements. Among them, this release upgrades to v2 of the Hive authorization framework.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201606.mbox/%3CCAPOmu3sDqdzu9ntDSvkMaDRQnVfHrkGV5qhyh-ZRiMmwgMMvBA@mail.gmail.com%3E

DataStax Enterprise 5.0, which is based on Apache Cassandra 3.0, adds support for Graph data, tiered storage,a and multi-instance for Cassandra. The release also includes additional security features like encryption and role based access control.

https://www.datastax.com/2016/06/introducing-datastax-enterprise-5-0

Driven, the big data application performance monitoring system, has released version 2.2. The highlight of this release is general availability of support for Apache Spark in Driven.

http://www.driven.io/2016/06/driven-inc-delivering-hadoop-spark-performance-monitoring-announces-driven-2-2/

BlueData has announced the release of their EPIC Enterprise Big Data as a Service product for Amazon Web Services. The software can be used for automatically provisioning Docker-based Hadoop clusters with a few clicks.

http://www.bluedata.com/blog/2016/06/big-data-as-a-service-on-prem-or-cloud-bdaas/

Apache Accumulo 1.7.2 was released. It includes but fixes to write-ahead log handling, optimizations for RFiles, and minor performance improvements.

https://accumulo.apache.org/release_notes/1.7.2.html

Versions 2.11.0 and 3.2.0 of Apache Curator, the high-level SDK for Apache ZooKeeper, has been released.

https://cwiki.apache.org/confluence/display/CURATOR/Releases#Releases-June23,2016,Releases2.11.0and3.2.0available

Apache Hive 2.1.0 was released. It includes a large number of bug fixes and improvements, including changes to Hive's Live Longer and Prosper as well as JDBC support.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201606.mbox/%3C7194557D-CB5E-45B7-B905-82F27B7CB33F@apache.org%3E

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Apache Metron Overview and Demo @ Hadoop Summit - Monday, June 27
http://www.meetup.com/futureofdata-sanfrancisco/events/230019237/

Apache Accumulo Meetup at Hadoop Summit (San Jose) - Monday, June 27
http://www.meetup.com/Accumulo-Users-DC/events/231397927/

Building Big Data Applications with Apache Beam and Apache Apex (San Jose) - Monday, June 27
http://www.meetup.com/Apex-Bay-Area-Chapter/events/231422630/

Robust Stream Processing with Apache Flink and Flink-Htm (San Jose) - Monday, June 27
http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/231347668/

Apache Ambari Meetup at Hadoop Summit (San Jose) - Monday, June 27
http://www.meetup.com/Apache-Ambari-User-Group/events/231576067/

War Stories of Making Software Work with Hadoop (San Jose) - Monday, June 27
http://www.meetup.com/Open-Data-Platform-Group/events/231569465/

NiFi Meetup at Hadoop Summit (San Jose) - Monday, June 27
http://www.meetup.com/ApacheNiFi/events/231191180/

North Carolina

Introduction to Spark In-Memory Computing (Durham) - Tuesday, June 28
http://www.meetup.com/Big-Data-Developers-in-Raleigh/events/231744087/

New York

How to Use Apache Ignite, In-Memory Data Fabric (New York) - Tuesday, June 28
http://www.meetup.com/mysqlnyc/events/231578906/

CANADA

Toronto Apache Spark #10 (Toronto) - Wednesday, June 29
http://www.meetup.com/Toronto-Apache-Spark/events/231023863/

UNITED KINGDOM

Crowdmix: An Event-Based Social Music Platform & Kafka 0.10 New Features (London) - Tuesday, June 28
http://www.meetup.com/Apache-Kafka-London/events/231792705/

INDIA

Introduction to Spark 2.0 (Bangalore) - Saturday, July 2
http://www.meetup.com/Bangalore-Spark-Enthusiasts/events/231684993/

CHINA

Shanghai BigData Streaming 3rd Meetup (Shanghai) - Saturday, July 2
http://www.meetup.com/Shanghai-Big-Data-Streaming-Meetup/events/231831396/