Data Eng Weekly


Hadoop Weekly Issue #186

11 September 2016

Apache NiFi hit 1.0.0 this week and Apache Hadoop announced version 3.0.0-alpha1. In addition to those big releases, there were a number of minor releases, and there are several interesting technical and news articles.

Technical

This post looks at the Apache Beam and Apache Spark programming models. The author argues that Beam's programming model and optimization capabilities make it a strong choice for stream processing, and that the sweet spot for Spark (and something Beam doesn't support well) is iterative algorithms. The post includes code implementing the same non-trivial use-case in both frameworks.

http://www.unofficialgoogledatascience.com/2016/08/next-generation-tools-for-data-science.html

The Amazon Big Data blog has a post that describes how to use a number of their services—VPC Flow Logs, Lambda, and Elastic MapReduce—to analyze traffic inside of a VPC. The actual analysis is implemented as two streaming MapReduce jobs written in python and awk.

http://blogs.aws.amazon.com/bigdata/post/Tx2Y2H2DYGSEURN/Processing-VPC-Flow-Logs-with-Amazon-EMR

The Hortonworks blog has the first part in a two-part series on Apache Ranger's dynamic column masking and row-level filtering. Part one motivates the features with example use-cases.

http://hortonworks.com/blog/eyes-dynamic-column-masking-row-level-filtering-hdp2-5/

JVM startup time can be a major limitation in trying to build a system with interactive queries. The Qubole blog describes how they solved this problem by using a ruby client for invoking Presto queries.

http://www.qubole.com/blog/product/presto-ruby-client-in-qds/

Apache Ambari Views Server is a standalone server that doesn't manage a Hadoop cluster but provides access to Views (e.g. to browse the file system). This tutorial describes how to configure an Ambari Views Server in version 2.4.

http://hortonworks.com/blog/introduction-ambari-views-2-4-new-feature-remote-cluster-configuration/

Skool is a new open-source tool from BT for data loading to Hadoop. It automates the process of setting up a new database -> HDFS job by validating connectivity, doing a test import, and generating an Oozie workflow to schedule incremental imports. The introductory post has more details on the tool and plans for future features.

http://blog.cloudera.com/blog/2016/09/skool-an-open-source-data-integration-tool-for-apache-hadoop-from-british-telecom/

Using an example data set of mobile network data, this tutorial describes how to build a streaming pipeline using Spark streaming and MapR streams. The examples use the Kafka API, so they should be mostly applicable to a Kafka cluster as well.

https://www.mapr.com/blog/how-get-started-spark-streaming-and-mapr-streams-using-kafka-api

News

Hadoop Summit Melbourne was this week. This post contains an overview of the event and summaries of talks from Hortonworks, Telstra, Yahoo! Japan, and the Barcelona Supercomputing Centre.

http://blog.dbvisit.com/hadoop-summit-melbourne-hs16melb-day-one-reflections/?

insideBIGDATA has an interview with Justin Kestelyn of Cloudera in which they discuss Cloudera's focus and changes over the past year, Apache Kudu, the future of Cloudera, and more.

http://insidebigdata.com/2016/09/07/interview-justin-kestelyn-technical-evangelism-developer-relations-at-cloudera/

The speaker lineup for Crunch Conference, the data engineering/analytics conference, has been posted. The conference takes place in Budapest from October 5th through October 7th.

http://www.crunchconf.com/

Releases

Apache NiFI released version 1.0.0 with a refreshed UI, automatic cluster coordination, improved authorization, and better support for version control of data flow templates.

https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.0.0

Apache Flink 1.1.2 is a bugfix release that resolves 19 tickets.

http://flink.apache.org/news/2016/09/05/release-1.1.2.html

IBM announced a technical preview of IBM Open Platform with Apache Hadoop 4.3. The release includes Apache Spark 2.0.

https://developer.ibm.com/hadoop/2016/09/06/ibm-open-platform-with-apache-hadoop-and-biginsights-4-3-technical-preview/

Apache Accumulo 1.8.0, which resolves over 200 bugs and contains 75 improvements/new features, was released. Major changes include speedups (WAL rollover improvements, rate-limiting of major compactions), improved API support for RFiles, table sampling, upgrading to Apache Thrift 0.9.3, and the ability to run mutiple tablet servers on a single node.

https://accumulo.apache.org/release_notes/1.8.0

Version 3.5.0 of the Malhar library for Apache Apex was released. The release includes a new windowing operator to fit with Apache Beam semantics, spillable data structures, a deduper task, and more.

https://blogs.apache.org/apex/entry/apache_apex_malhar_3_5

The Apache NiFi project announced a new release artifact—MiNiFi C++—which is a new C++ library for collecting sensor data where the JVM isn't practical. Version 0.0.1 is now available.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201609.mbox/%3C754ED8E4-36FD-4930-ADB6-6A24CD4DE78F@apache.org%3E

Apache Storm has announced a maintenance release, version 0.9.7, which resolves issues with multi-lang support.

http://storm.apache.org/2016/09/07/storm097-released.html

Confluent announced version 3.0.1 of the Confluent Platform based on Apache Kafka 0.10.0.1.

http://www.confluent.io/blog/announcing-confluent-platform-3-0-1-apache-kafka-0-10-0-1/

Apache Hadoop has announced the first alpha release of Hadoop 3.0. As noted in the announcement, the release contains thousands of changes since the 2.7.x release. Major highlights (from the release overview) include a move to Java 8, support for erasure encodings in HDFS, a new YARN Timeline Service, a shell script rewrite, support for more than two NameNodes, intra-datanode balancing, and more.

http://mail-archives.apache.org/mod_mbox/hadoop-general/201609.mbox/%3CCAGB5D2ZSPgDZ1zn0D4b6dyCAkA90mT%2B6xAbe%2BCxS4e2D0QGZqw%40mail.gmail.com%3E

DataMountaineer, Confluent, and DataStax have announced a Certified DataStax connector for Kafka Connect.

http://www.confluent.io/blog/announcing-the-certified-datastax-connector

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

September 2016 Meetup (San Francisco) - Tuesday, September 13
http://www.meetup.com/hadoopsf/events/233741637/

Introduction to Succinct by UC Berkeley AmpLab (San Francisco) - Tuesday, September 13
http://www.meetup.com/SF-Big-Analytics/events/232782222/

Demo and Open House: Big Data Processing with Apache Spark (San Francisco) - Tuesday, September 13
http://www.meetup.com/Metis-San-Francisco-Data-Science/events/233765349/

Lambda-In-A-Box: Combining Spark and HBase (San Francisco) - Wednesday, September 14
http://www.meetup.com/sfjava/events/233806080/

Alluxio: Unifying APIs, Accelerating ML, & Enabling Cloud Architectures (San Francisco) - Wednesday, September 14
http://www.meetup.com/Alluxio/events/233453125/

September Apache Kafka Meetup (Mountain View) - Thursday, September 15
http://www.meetup.com/http-kafka-apache-org/events/232757214/

How to Simplify Your Streaming Data Architecture with Kafka and Voltdb (Menlo Park) - Thursday, September 15
http://www.meetup.com/Silicon-Valley-ODSC/events/233568054/

Explore Big Data at Speed of Thought with Spark 2.0 and Snappydata (Culver City) - Thursday, September 15
http://www.meetup.com/Los-Angeles-Apache-Spark-Users-Group/events/233387686/

Washington

Top 5 Mistakes When Writing Spark Applications (Seattle) - Wednesday, September 14
http://www.meetup.com/seattle-data-geeks/events/233316416/

Texas

Microsoft Azure HDInsight: Hadoop in the Cloud (Addison) - Monday, September 12
http://www.meetup.com/DFW-Data-Science/events/232501203/

Ohio

Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, September 12
http://www.meetup.com/Cleveland-Hadoop/events/228560668/

Virginia

Machine Learning and Graph Processing on Accumulo w/ Spark (Arlington) - Wednesday, September 14
http://www.meetup.com/Data-Wranglers-DC/events/225934861/

Georgia

Integrating Real-time Data Streams with Spark & Kafka (Atlanta) - Thursday, September 15
http://www.meetup.com/BigData-Atlanta/events/233722045/

New Jersey

Data Pipelines with Kafka Connect (Princeton) - Wednesday, September 14
http://www.meetup.com/futureofdata-princeton/events/232204892/

New York

Past, Present, and Future of Apache Ambari (New York) - Tuesday, September 13
http://www.meetup.com/futureofdata-newyork/events/231868226/

BRAZIL

Cloudera Meetup (Sao Paulo) - Friday, September 16
http://www.meetup.com/Cloudera-Brasil-Meetup/events/233602601/

SWEDEN

Jenkins Meets Spark: Building a Continuous Integration Intelligence (Stockholm) - Thursday, September 15
http://www.meetup.com/Stockholm-Spark/events/233937082/

POLAND

Kafka, Parallel Streams, Consul (Gdansk) - Wednesday, September 14
http://www.meetup.com/Nordea-Tech-Day/events/233464398/

Apache Airflow: Best Practices and Roadmap (Warsaw) - Wednesday, September 14
http://www.meetup.com/warsaw-hug/events/233912332/

SERBIA

Building Scalable Machine Learning Applications at Groupon (Belgrade) - Wednesday, September 14
http://www.meetup.com/Belgrade-Spark-Meetup/events/233883695/

Enter Kafka Streams (Novi Sad) - Thursday, September 15
http://www.meetup.com/Big-Data-Novi-Sad/events/233709585/

ROMANIA

September Meetup: Spark in Adobe & Intro in Kafka (Bucharest) - Wednesday, September 14
http://www.meetup.com/Bucharest-Big-Data-Meetup/events/233466282/

AUSTRALIA

Apache Spark Day @ IDUG Australia (Sydney) - Tuesday, September 13
http://www.meetup.com/Big-Data-Developers-in-Sydney/events/233436346/