Data Eng Weekly

Hadoop Weekly Issue #183

21 August 2016

This week's issue is short and sweet, featuring articles on Hadoop, Spark, Kafka, and HAWQ. In releases, Apache Phoenix and Apache Gearpump, which is a relatively new incubator project for real-time streaming implemented with Akka Actors, both had a releases.


SparkSession, exposed as spark in the spark-shell, is a new API in Spark 2.0. The SparkSession aims to be a unified entry point by providing the same functionality as the SparkContext, SQLContext, and more (including creation of Datasets and Dataframes). The Databricks blog has an overview of the main functionality of SparkSession.

The Hortonworks blog has a post about Hadoop in the cloud. It discusses some of the challenges (e.g. different semantics in blob stores, different security integration), and the improvements planned to address them (e.g. caching, improved connectors).

Pivotal has created a docker-based sandbox (both single and multi-node) for Apache HAWQ (incubating). HAWQ is a MPP database that's integrated with Hadoop. There introductory block post has background and information on getting started.

Apache Hadoop has a tool called create-release for building releases, creating release notes, signing artifacts, and more. This post describes how to use the tool to build your own release, and how to use some of the non-default settings such as building native libraries and building via Docker.

The LINE Engineers' Blog has a post on how LINE is deploying Kafka with Kafka Streams. The article describes how they decided on Kafka Streams (vs. Samza), some of its compelling features, two applications built on the platform, and some of the improvements to Kafka Streams the LINE team has made.


Hortonworks and Microsoft have a strong partnership—the Hortonworks Data Platform powers Azure HDInsight, and Hortonworks worked on Windows support for Hadoop. This interview has more details about the relationship and the state of Hadoop on Microsoft Azure.

The Hadoop project recently captured some criteria and example paths to becoming a committer for the project. If you're a contributor or are thinking of contributing, this adds some useful context to what it takes to become a committer.

Cloudera has reaffirmed their commitment to Apache Spark. The post highlight streaming and machine learning applications, and it notes that Spark 2.0 has high expectations.


Apache Gearpump (incubating), the real-time stream processing system built on Akka, has released version 0.8.1-incubating. The release includes a number of changes (e.g. link updates, package renames) related to the projects entry into the Apache incubator.

Apache Phoenix 4.8 was released this week. The release includes a number of bug fixes, OFFSET support for pagination, Apache Hive integration, and more.


Curated by Datadog ( )



Stream Processing Meetup @ LinkedIn (Mountain View) - Tuesday, August 23

#SDBigData Meetup #17 (San Diego) - Wednesday, August 24

Lambda, Kinesis, Spacepods & More... (Santa Monica) - Wednesday, August 24

Apache Cassandra + Spark Makeover w/ Apache Zeppelin & Scaling Cassandra at Uber (San Francisco) - Wednesday, August 24

Focusing on Ingest into Hive/Impala and Streaming with Kafka (Palo Alto) - Thursday, August 25

Meetup at Salesforce (San Francisco) - Thursday, August 25


Beyond ETL: Real-Time, Streaming Architectures (Plano) - Tuesday, August 23


Processing & Serving 60 Terabytes of Data… Per Day! (Kansas City) - Tuesday, August 23

Talend: Integrating Real-time Data Streams with Spark and Kafka (Kansas City) - Thursday, August 25


Flink Cluster on Cloud in Minutes to Analyze Streaming Time Series Data (Chicago) - Wednesday, August 24


2016 BigDataWisconsin Conference (Madison) - Monday, August 22

North Carolina

Integrating Hadoop and SQL Server and Comparison of All SQL-on-Hadoop Options (Durham) - Thursday, August 25


Fast-Data Meetup Event (McLean) - Wednesday, August 24


Lighting Fires and Predicting User Behaviour with Spark (London) - Wednesday, August 24

Harnessing Kafka for Payment Processing at Massive Scale w/ Joe Nash, Improbable (London) - Thursday, August 25


Join Us for Our First Meetup in Stockholm, Hosted by Spotify (Stockholm) - Tuesday, August 23


Spark, Topic Models, and Content Recommendation (Amsterdam) - Friday, August 26


Ankara Tech Talks #3 (Ankara) - Friday, August 26


Apache Storm and Real-Time Data Ingestion (Noida) - Wednesday, August 24

MapReduce and the Art of Thinking Parallel: Dr. Shailesh Kumar (Hyderabad) - Saturday, August 27

Anatomy of Spark SQL Catalyst, Part 2 (Bangalore) - Saturday, August 27


MLDM Monday | Spark (Taipei) - Monday, August 22