Data Eng Weekly

Hadoop Weekly Issue #160

06 March 2016

This week, Hortonworks made several announcements, including a new partnership and changes to the way they're shipping HDP. They also announced support for Spark 1.6, and Spark is a big theme this week (with articles on logging, memory settings, GraphFrames). In terms of release, Apache Hive, MRQL (incubating), and Kudu (incubating) had releases this week.


As the Hadoop ecosystem has grown to include a number of stream processing and low-latency querying frameworks, the term "real-time" gets thrown around a lot. This post aims to disambiguate the phrase by explaining the differences between sub-second response, human comfortable response time, event-driven, streaming data processing, and near real-time.

This post takes a look at configuring and generating logs in Spark. Because of the way in which Spark serializes and distributes closures, there is a gotcha with the way in which you can use loggers. The post describes a couple of solutions to this issue.

Altiscale has "part 1.1" in a series on Hadoop NodeGroups, in which it discusses enabling the four-layer network topology for a Docker-based deployment and the discovery of a related performance degradation.

The DataTorrent blog has a post describing how Apache Apex (incubating) implements exactly-once processing even when interacting with external systems. It describes how the semantics are maintained with Kafka as input and JDBC as output. There is also an overview a new Kafka 0.9-based connector, which is much simpler due to the new consumer API in that release.

In another series on the Altiscale blog, part 4 of their Spark on Hadoop series covers Spark memory settings. It looks at how command-line arguments for drivers and executors correspond to the actual memory allocated for JVMs when running Spark-on-YARN. To explain these numbers, it dives into logs and source code.

The AWS Big Data Blog has a tutorial about using DynamoDB from Spark. The article describes how to create an RDD using the Hadoop DynamoDBInputFormat, which is a general purpose solution for any Hadoop InputFormat.

This post describes GraphFrames, a graph library built using Spark DataFrames. The library is compatible with Python, Java, and Scala APIs. There's an example of computing some basic computations as well as PageRank and a discussion of the relationship between GraphFrames and Spark's GraphX.


Hortonworks held an event called "The Future of Data" this week. ZDNet and CMSWire have coverage of the announcements, which include a new partnership with Hewlett Packard Enterprise on Apache Spark, changes to the support model for Hortonworks DataFlow, and a new release schedule for Hortonworks Data Platform (more details below).

On the heels of Spark Summit East, the Altiscale blog has an article with several news clips about Spark.


Hortonworks has released HDP 2.4 and announced a new release strategy. In the new cadence, core services (such as Hadoop) will be updated yearly and extended services, such as Spark, will be updated more frequently. The post has a lot more information about the new strategy, and there's another post about the first of the extended releases—Apache Spark 1.6.

Apache Hive 2.0.0 was recently released. The release resolves over 1,000 issues, which have helpfully been distilled into several highlights on the Cloudera blog. These include an alpha version of an HBase metastore, several Hive-on-Spark improvements, performance optimizations (such as Parquet predicate pushdown), and a new HiveServer2 web UI.

Apache Kudu 0.7.0-incubating was released his week. The release notes summarize key changes and improvements, which include a new python client, an improved Spark integration, new server-level metrics, and bug fixes (file descriptor leak, hang in Java client).

Version 0.9.6-incubating of Apache MRQL, the query processing framework, was released this week. MRQL supports MapReduce, Apache Hama, Apache Spark, and Apache Flink as backends, and the supported versions of Flink and Hama have been updated as part of the release. The release notes have more details about the contents of the release.

Cloudera Enterprise 5.6.0 was released this week. The new release adds support for EMC's DSSD D5.


Curated by Datadog ( )



Building Apps with Distributed In-Memory Computing Using Apache Geode (Palo Alto) - Monday, March 7

Building Real-Time Data Pipelines with Spark, Kafka, and Python (San Francisco) - Wednesday, March 9

Cloud Control: Efficient Hadoop ETL Processing with 85% Spot Utilization (San Francisco) - Thursday, March 10


A Primer Into Jupyter, Spark on HDInsight, and Office 365 Analytics with Spark (Bellevue) - Wednesday, March 9


What's New in Hadoop? Hive on Tez and Spark, Compression, Encryption, and More (Houston) - Tuesday, March 8


Neo4j for Process Mining and Hadoop on AWS! (Wyoming) - Wednesday, March 9


Virtualizing Big Data: Effective Approaches Derived from Real Deployment (Atlanta) - Wednesday, March 9

New Jersey

Scala + Spark SQL Workshop (Hamilton Township) - Thursday, March 10

New York

Scaling Your R Analytics Using Hadoop & Spark w/ IBM & Galvanize! (New York) - Tuesday, March 8


Eat, Drink, and Talk about HDInsight (Cedar Rapids) - Tuesday, March 8

IRELAND Big Data, AWS & The Data Pipeline + Distributed MPP & Analytics with HPCC (Dublin) - Monday, March 7


Tech Nottingham: Joe Nash on Kafka (Nottingham) - Monday, March 7

SMACK & Data Modelling (London) - Tuesday, March 8

Big Data Bootcamp (London) - Saturday, March 12


Big Data, No Fluff: Let’s Get Started with Hadoop #6 (Oslo) - Thursday, March 10


Speed-Up Distributed Deep Learning with Spark on AWS (Barcelona) - Thursday, March 10


SMACK & Achilles (Paris) - Monday, March 7

NightClazz Spark + Machine Learning (Paris) - Thursday, March 10


Stream Processing with Apache Flink and Mining Github (Amsterdam) - Thursday, March 10


Apache Spark Workshops (Torun) - Saturday, March 12


Hadoop Meetup #5 (Vilnius) - Monday, March 7


Spark Streaming and GraphX (Vienna) - Tuesday, March 8


Unsupervised Learning with Apache Spark (Zagreb) - Wednesday, March 9


Know Your Distributed Tools, Apache Tez and Spark (Tel Aviv-Yafo) - Wednesday, March 9


Big Data Processing with Apache Spark (Hyderabad) - Saturday, March 12

Comparison of Various Streaming Technologies (Bangalore) - Saturday, March 12


ML Pipeline Demo, Spark Code Generation with Talend + More (Auckland) - Tuesday, March 8