Data Eng Weekly

Hadoop Weekly Issue #147

29 November 2015

With the holiday in the US, it was a relatively light news week, but there are good technical articles covering stream processing, Spark, and more. Also, there were a number of releases—Apache Drill, Apache Flink, and Apache Kafka.


This post describes Apache Flink's DataStream API by way of an example program that processes Tweets from the Twitter API. It covers setting up a local development environment, how to write a custom StreamGenerator (of Tweets), and how to run the program via the Flink command-line utils.

This presentation gives a practical overview of two popular stream processing frameworks—Apache Storm and Apache Spark Streaming. There's some advice about both (including pros and cons of each) as well as some rules for when to use one or the other.

This post describes four ways to integrate R with Hadoop. There's also an example of using the RHadoop library for interacting with data in HDFS and running a MapReduce job.

The morning paper covered "Asynchronous Complex Analytics in a Distributed Dataflow Architecture," which looks at mechanisms to increase performance of machine learning calculations in distributed systems like Hadoop and Spark. The authors have built a prototype atop of Spark using Asynchronous Sideways Information Passing (ASIP), which has different characteristics from the Bulk Synchronous Parallel model typically used. The paper describes some of the challenges of the implementation and describes the performance.

The MapR blog has a brief introduction to pyspark, the Python bindings for Apache Spark.

The upcoming Apache Spark 1.6 has support for directly querying the contents of a file without first creating a table. This doc has some examples of using the feature.


The SystemML project, which is a large-scale machine learning framework with support for Hadoop and Spark execution models, has been accepted into the Apache Incubator. SystemML was open-sourced by IBM earlier this year.

Apache: Big Data North America is May 9-12, 2016 in Vancouver, Canada. The Call for Proposals is open now through February 12th.


Version 0.3.4 of Schedoscope, the scheduling framework for Hadoop data warehouses, was recently released. The new version adds support for Hive 1.1.0, is based on Scala 2.11, and includes major performance improvements.

Apache Kafka 0.9.0 was released this week. The Confluent blog has a summary of the major work in the release (there were over 500 Jira issues resolved), which include security, Kafka Connect (for copying data in and out of Kafka), a new consumer API, and user-defined quotas (on a per-client basis). The new version also drops support for Java 6 and Scala 2.9.

The 1.3 version of Apache Drill was released this week with several new features. Highlights include enhanced S3 support, heterogeneous type support, header parsing for text files, and support for sequence files.

On the heels of the recent 0.10.0 release, Apache Flink announced the 0.10.1 bugfix release. It's a recommended upgrade for all users, and it resolves over 20 issues.

Cloudera released the second beta of Kudu, the new storage engine for Hadoop. Version 0.6.0 contains changes to the Java client, new commands in the kudu-admin tool, support for single-node development on OS X, and more.


Curated by Datadog ( )



Streaming Data Analytics: Next-Generation Big Data Techniques (San Francisco) - Tuesday, December 1

Big Data Application Meetup (Palo Alto) - Wednesday, December 2

Apache Eagle: Secure Your Hadoop Data (San Jose) - Thursday, December 3

Baidu and Spark (Sunnyvale) - Thursday, December 3


SnappyData: Real Time Operational Analytics with Apache Spark! (Portland) - Tuesday, December 1


Uniting Spark and Hadoop: The One Platform Initiative (Scottsdale) - Wednesday, December 2


American Airlines, Datameer and Cloudera (Fort Worth) - Thursday, December 3


Continuous Data Management for Hadoop and Spark: On-Premise or in the Cloud (Chicago) - Tuesday, December 1

District of Columbia

IBM Lights the Spark in DC (Washington) - Thursday, December 3


Continuous Data Management for Hadoop and Spark: On-Premise or in the Cloud (Boston) - Thursday, December 3


Spark Technology Discussion & Demo (Kitchener) - Tuesday, December 1


Big Trouble: Getting into the Flow of Hadoop Testing (London) - Monday, November 30


Google Cloud Dataproc & the Network Behind the Elephant (Stockholm) - Wednesday, December 2


Lauri Niskanen: A Recommendation System Illustrated with Spark (Tampere) - Tuesday, December 1


MUG #1 - Mesos Fundamentals (Warsaw) - Friday, December 4


Streaming Data with Apache Kafka (Zagreb) - Wednesday, December 2


Apache Spark in the Cloud, Fighting World Hunger (Tel Aviv-Yafo) - Tuesday, December 1


Shanghai Big Data Streaming 2nd Meetup (Shanghai) - Sunday, December 6


Spark Meetup during Strata! (Singapore) - Tuesday, December 1

Meetup @ Strata with Doug Cutting, Ted Dunning + more (Singapore) - Wednesday, December 2

Strata Community Event: Productionizing Data Science at Scale (Singapore) - Thursday, December 3


Apache Flink and NiFi (Melbourne) - Tuesday, December 1