11 September 2016
Apache NiFi hit 1.0.0 this week and Apache Hadoop announced version 3.0.0-alpha1. In addition to those big releases, there were a number of minor releases, and there are several interesting technical and news articles.
This post looks at the Apache Beam and Apache Spark programming models. The author argues that Beam's programming model and optimization capabilities make it a strong choice for stream processing, and that the sweet spot for Spark (and something Beam doesn't support well) is iterative algorithms. The post includes code implementing the same non-trivial use-case in both frameworks.
http://www.unofficialgoogledatascience.com/2016/08/next-generation-tools-for-data-science.html
The Amazon Big Data blog has a post that describes how to use a number of their services—VPC Flow Logs, Lambda, and Elastic MapReduce—to analyze traffic inside of a VPC. The actual analysis is implemented as two streaming MapReduce jobs written in python and awk.
http://blogs.aws.amazon.com/bigdata/post/Tx2Y2H2DYGSEURN/Processing-VPC-Flow-Logs-with-Amazon-EMR
The Hortonworks blog has the first part in a two-part series on Apache Ranger's dynamic column masking and row-level filtering. Part one motivates the features with example use-cases.
http://hortonworks.com/blog/eyes-dynamic-column-masking-row-level-filtering-hdp2-5/
JVM startup time can be a major limitation in trying to build a system with interactive queries. The Qubole blog describes how they solved this problem by using a ruby client for invoking Presto queries.
http://www.qubole.com/blog/product/presto-ruby-client-in-qds/
Apache Ambari Views Server is a standalone server that doesn't manage a Hadoop cluster but provides access to Views (e.g. to browse the file system). This tutorial describes how to configure an Ambari Views Server in version 2.4.
http://hortonworks.com/blog/introduction-ambari-views-2-4-new-feature-remote-cluster-configuration/
Skool is a new open-source tool from BT for data loading to Hadoop. It automates the process of setting up a new database -> HDFS job by validating connectivity, doing a test import, and generating an Oozie workflow to schedule incremental imports. The introductory post has more details on the tool and plans for future features.
Using an example data set of mobile network data, this tutorial describes how to build a streaming pipeline using Spark streaming and MapR streams. The examples use the Kafka API, so they should be mostly applicable to a Kafka cluster as well.
https://www.mapr.com/blog/how-get-started-spark-streaming-and-mapr-streams-using-kafka-api
Hadoop Summit Melbourne was this week. This post contains an overview of the event and summaries of talks from Hortonworks, Telstra, Yahoo! Japan, and the Barcelona Supercomputing Centre.
http://blog.dbvisit.com/hadoop-summit-melbourne-hs16melb-day-one-reflections/?
insideBIGDATA has an interview with Justin Kestelyn of Cloudera in which they discuss Cloudera's focus and changes over the past year, Apache Kudu, the future of Cloudera, and more.
The speaker lineup for Crunch Conference, the data engineering/analytics conference, has been posted. The conference takes place in Budapest from October 5th through October 7th.
Apache NiFI released version 1.0.0 with a refreshed UI, automatic cluster coordination, improved authorization, and better support for version control of data flow templates.
https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.0.0
Apache Flink 1.1.2 is a bugfix release that resolves 19 tickets.
http://flink.apache.org/news/2016/09/05/release-1.1.2.html
IBM announced a technical preview of IBM Open Platform with Apache Hadoop 4.3. The release includes Apache Spark 2.0.
Apache Accumulo 1.8.0, which resolves over 200 bugs and contains 75 improvements/new features, was released. Major changes include speedups (WAL rollover improvements, rate-limiting of major compactions), improved API support for RFiles, table sampling, upgrading to Apache Thrift 0.9.3, and the ability to run mutiple tablet servers on a single node.
https://accumulo.apache.org/release_notes/1.8.0
Version 3.5.0 of the Malhar library for Apache Apex was released. The release includes a new windowing operator to fit with Apache Beam semantics, spillable data structures, a deduper task, and more.
https://blogs.apache.org/apex/entry/apache_apex_malhar_3_5
The Apache NiFi project announced a new release artifact—MiNiFi C++—which is a new C++ library for collecting sensor data where the JVM isn't practical. Version 0.0.1 is now available.
Apache Storm has announced a maintenance release, version 0.9.7, which resolves issues with multi-lang support.
http://storm.apache.org/2016/09/07/storm097-released.html
Confluent announced version 3.0.1 of the Confluent Platform based on Apache Kafka 0.10.0.1.
http://www.confluent.io/blog/announcing-confluent-platform-3-0-1-apache-kafka-0-10-0-1/
Apache Hadoop has announced the first alpha release of Hadoop 3.0. As noted in the announcement, the release contains thousands of changes since the 2.7.x release. Major highlights (from the release overview) include a move to Java 8, support for erasure encodings in HDFS, a new YARN Timeline Service, a shell script rewrite, support for more than two NameNodes, intra-datanode balancing, and more.
DataMountaineer, Confluent, and DataStax have announced a Certified DataStax connector for Kafka Connect.
http://www.confluent.io/blog/announcing-the-certified-datastax-connector
Curated by Datadog ( http://www.datadog.com )
September 2016 Meetup (San Francisco) - Tuesday, September 13
http://www.meetup.com/hadoopsf/events/233741637/
Introduction to Succinct by UC Berkeley AmpLab (San Francisco) - Tuesday, September 13
http://www.meetup.com/SF-Big-Analytics/events/232782222/
Demo and Open House: Big Data Processing with Apache Spark (San Francisco) - Tuesday, September 13
http://www.meetup.com/Metis-San-Francisco-Data-Science/events/233765349/
Lambda-In-A-Box: Combining Spark and HBase (San Francisco) - Wednesday, September 14
http://www.meetup.com/sfjava/events/233806080/
Alluxio: Unifying APIs, Accelerating ML, & Enabling Cloud Architectures (San Francisco) - Wednesday, September 14
http://www.meetup.com/Alluxio/events/233453125/
September Apache Kafka Meetup (Mountain View) - Thursday, September 15
http://www.meetup.com/http-kafka-apache-org/events/232757214/
How to Simplify Your Streaming Data Architecture with Kafka and Voltdb (Menlo Park) - Thursday, September 15
http://www.meetup.com/Silicon-Valley-ODSC/events/233568054/
Explore Big Data at Speed of Thought with Spark 2.0 and Snappydata (Culver City) - Thursday, September 15
http://www.meetup.com/Los-Angeles-Apache-Spark-Users-Group/events/233387686/
Top 5 Mistakes When Writing Spark Applications (Seattle) - Wednesday, September 14
http://www.meetup.com/seattle-data-geeks/events/233316416/
Microsoft Azure HDInsight: Hadoop in the Cloud (Addison) - Monday, September 12
http://www.meetup.com/DFW-Data-Science/events/232501203/
Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, September 12
http://www.meetup.com/Cleveland-Hadoop/events/228560668/
Machine Learning and Graph Processing on Accumulo w/ Spark (Arlington) - Wednesday, September 14
http://www.meetup.com/Data-Wranglers-DC/events/225934861/
Integrating Real-time Data Streams with Spark & Kafka (Atlanta) - Thursday, September 15
http://www.meetup.com/BigData-Atlanta/events/233722045/
Data Pipelines with Kafka Connect (Princeton) - Wednesday, September 14
http://www.meetup.com/futureofdata-princeton/events/232204892/
Past, Present, and Future of Apache Ambari (New York) - Tuesday, September 13
http://www.meetup.com/futureofdata-newyork/events/231868226/
Cloudera Meetup (Sao Paulo) - Friday, September 16
http://www.meetup.com/Cloudera-Brasil-Meetup/events/233602601/
Jenkins Meets Spark: Building a Continuous Integration Intelligence (Stockholm) - Thursday, September 15
http://www.meetup.com/Stockholm-Spark/events/233937082/
Kafka, Parallel Streams, Consul (Gdansk) - Wednesday, September 14
http://www.meetup.com/Nordea-Tech-Day/events/233464398/
Apache Airflow: Best Practices and Roadmap (Warsaw) - Wednesday, September 14
http://www.meetup.com/warsaw-hug/events/233912332/
Building Scalable Machine Learning Applications at Groupon (Belgrade) - Wednesday, September 14
http://www.meetup.com/Belgrade-Spark-Meetup/events/233883695/
Enter Kafka Streams (Novi Sad) - Thursday, September 15
http://www.meetup.com/Big-Data-Novi-Sad/events/233709585/
September Meetup: Spark in Adobe & Intro in Kafka (Bucharest) - Wednesday, September 14
http://www.meetup.com/Bucharest-Big-Data-Meetup/events/233466282/
Apache Spark Day @ IDUG Australia (Sydney) - Tuesday, September 13
http://www.meetup.com/Big-Data-Developers-in-Sydney/events/233436346/