21 August 2016
This week's issue is short and sweet, featuring articles on Hadoop, Spark, Kafka, and HAWQ. In releases, Apache Phoenix and Apache Gearpump, which is a relatively new incubator project for real-time streaming implemented with Akka Actors, both had a releases.
SparkSession, exposed as
spark in the spark-shell, is a new API in Spark 2.0. The SparkSession aims to be a unified entry point by providing the same functionality as the SparkContext, SQLContext, and more (including creation of Datasets and Dataframes). The Databricks blog has an overview of the main functionality of SparkSession.
The Hortonworks blog has a post about Hadoop in the cloud. It discusses some of the challenges (e.g. different semantics in blob stores, different security integration), and the improvements planned to address them (e.g. caching, improved connectors).
Pivotal has created a docker-based sandbox (both single and multi-node) for Apache HAWQ (incubating). HAWQ is a MPP database that's integrated with Hadoop. There introductory block post has background and information on getting started.
Apache Hadoop has a tool called
create-release for building releases, creating release notes, signing artifacts, and more. This post describes how to use the tool to build your own release, and how to use some of the non-default settings such as building native libraries and building via Docker.
The LINE Engineers' Blog has a post on how LINE is deploying Kafka with Kafka Streams. The article describes how they decided on Kafka Streams (vs. Samza), some of its compelling features, two applications built on the platform, and some of the improvements to Kafka Streams the LINE team has made.
Hortonworks and Microsoft have a strong partnership—the Hortonworks Data Platform powers Azure HDInsight, and Hortonworks worked on Windows support for Hadoop. This interview has more details about the relationship and the state of Hadoop on Microsoft Azure.
The Hadoop project recently captured some criteria and example paths to becoming a committer for the project. If you're a contributor or are thinking of contributing, this adds some useful context to what it takes to become a committer.
Cloudera has reaffirmed their commitment to Apache Spark. The post highlight streaming and machine learning applications, and it notes that Spark 2.0 has high expectations.
Apache Gearpump (incubating), the real-time stream processing system built on Akka, has released version 0.8.1-incubating. The release includes a number of changes (e.g. link updates, package renames) related to the projects entry into the Apache incubator.
Apache Phoenix 4.8 was released this week. The release includes a number of bug fixes, OFFSET support for pagination, Apache Hive integration, and more.
Curated by Datadog ( http://www.datadog.com )
Stream Processing Meetup @ LinkedIn (Mountain View) - Tuesday, August 23
#SDBigData Meetup #17 (San Diego) - Wednesday, August 24
Lambda, Kinesis, Spacepods & More... (Santa Monica) - Wednesday, August 24
Apache Cassandra + Spark Makeover w/ Apache Zeppelin & Scaling Cassandra at Uber (San Francisco) - Wednesday, August 24
Focusing on Ingest into Hive/Impala and Streaming with Kafka (Palo Alto) - Thursday, August 25
Meetup at Salesforce (San Francisco) - Thursday, August 25
Beyond ETL: Real-Time, Streaming Architectures (Plano) - Tuesday, August 23
Processing & Serving 60 Terabytes of Data… Per Day! (Kansas City) - Tuesday, August 23
Talend: Integrating Real-time Data Streams with Spark and Kafka (Kansas City) - Thursday, August 25
Flink Cluster on Cloud in Minutes to Analyze Streaming Time Series Data (Chicago) - Wednesday, August 24
2016 BigDataWisconsin Conference (Madison) - Monday, August 22
Integrating Hadoop and SQL Server and Comparison of All SQL-on-Hadoop Options (Durham) - Thursday, August 25
Fast-Data Meetup Event (McLean) - Wednesday, August 24
Lighting Fires and Predicting User Behaviour with Spark (London) - Wednesday, August 24
Harnessing Kafka for Payment Processing at Massive Scale w/ Joe Nash, Improbable (London) - Thursday, August 25
Join Us for Our First Meetup in Stockholm, Hosted by Spotify (Stockholm) - Tuesday, August 23
Spark, Topic Models, and Content Recommendation (Amsterdam) - Friday, August 26
Ankara Tech Talks #3 (Ankara) - Friday, August 26
Apache Storm and Real-Time Data Ingestion (Noida) - Wednesday, August 24
MapReduce and the Art of Thinking Parallel: Dr. Shailesh Kumar (Hyderabad) - Saturday, August 27
Anatomy of Spark SQL Catalyst, Part 2 (Bangalore) - Saturday, August 27
MLDM Monday | Spark (Taipei) - Monday, August 22