Data Eng Weekly

Hadoop Weekly Issue #146

22 November 2015

This week is a short, holiday week in the US—and a lot of projects shipped new releases before going on break. Also, there are technical articles about Flume, Spark, YARN, and Sentry. Finally, Cloudera unveiled proposals for Impala and Kudu to join the Apache Incubator.


This post contains some tips for working with Flume. Specifically, it looks at how channels hookup to sources and sinks, how to achieve fanout, and how to modify the flume JVM arguments.

The Altiscale blog has an update on progress to improve the support for launching Docker containers from YARN. The LinuxContainerExecutor in trunk now supports running containers via Docker based on a run-time option. The feature is targeted for the 2.8 release of Apache Hadoop.

This post describes some of the pros and cons of Apache Sentry, the authorization system for Hadoop clusters. It notes a number of limitations / wish list items (some of which are already in progress or implemented, but not yet released), particularly for Solr integration, HDFS/Hive synchronization, and additional systems.

Oftentimes, Spark is described as a panacea for everyone's big data problems. While it's quite nice in many situations, this post describes five issues that crop up in production. They include memory issues, small files problems, strange low-level errors, and more.

This presentation, from QCon SF, describes the highlights of several seminal papers in distributed systems related to eventual consistency and system verification. For each of the papers, there's a list of key-takeaways, which provides great context for picking up a new paper.


This week, Cloudera submitted proposals for Impala and Kudu to join the Apache Software Foundation Incubator. The Cloudera blog describes the evolution of Impala—both in terms of features and external contributions. Cloudera also affirms that they believe Kudu, Impala, and Spark will be important core components of Hadoop in the long-term.


Splice Machine has launched a version 2.0 of its RDBMS built on Hadoop. The new version adds support for Spark, the ability to mix simultaneous OLAP and OLTP workloads, a priority level mechanism to best utilize resources, and a management console to monitor queries.

Hivemall, the library of machine learning functions for Apache Hive, recently released version 0.4.0. The new release includes support for Factorization Machine and Random Forest classification/regression.

Apache Flink 0.10.0 was released this week. The new version contains a number of new features, and the DataStream API is now considered production-ready. Among the new features, Flink 0.10.0 adds support for event-time stream processing (i.e. considering event time rather than ingestion time), stateful stream processing, high-availablility, a new web dashboard, off-heap managed memory, outer joins, and much more.

Amazon EMR 4.2.0 adds new versions of Apache Spark, Presto, Apache Zeppelin, and Apache Oozie. The AWS blog has more details on the features of the new versions.

Qubole has added the ability to share Spark RDDs across multiple jobs as part of their Spark Job Server API. To power this feature, Qubole is using Apache Zeppelin.

Version 0.5.5-incubating of Apache Zeppelin was released. Zeppelin is a web-based notebook service with support for data analysis using various backends, such as Apache Spark.

Cloudera Enterprise 5.5 was released this week, with a the new Cloudera Navigator Optimizer (beta), security improvements (such as column-level security in Impala and Hive), improved performance/scale/operations, and more. The new release updates versions of Apache Spark, Apache Flume, Apache Sqoop, Apache Sentry, HUE, and Impala. It also adds support for RHEL 7 and MariaDB.

As part of Cloudera Enterprise 5.5, Cloudera released Impala 2.3. The biggest feature of the new release is support for querying of complex/nested data types when stored in Apache Parquet tables. The Cloudera blog has a post describing the SQL extensions that they've implemented to this end, including several examples.


Curated by Datadog ( )



Flink 0.10: Graduating the Streaming API (Chicago) - Tuesday, November 24


Beyond Shuffling: Tips & Tricks for Scaling Apache Spark (Vancouver) - Monday, November 23

Toronto Apache Spark #3 (Toronto) - Wednesday, November 25

A Sneak Peek Into Spark 1.6: From RDD to DataFrames to Datasets (Vancouver) - Wednesday, November 25

Architecture Review Session (Toronto) - Friday, November 27

IRELAND Greenplum, Apache HAWQ, MPP vs Hadoop, Modern Data Architecture (Dublin) - Saturday, November 28


Security Considerations in Hadoop and Big Data + More (London) - Tuesday, November 24

Autoscaling Spark + Spark Execution Model (London) - Thursday, November 26


Spark After Dark 1.5 with Chris Fregly (Stockholm) - Monday, November 23

Flink Streaming at Ericsson Research (Kista) - Thursday, November 26


SparkR + H2O (Barcelona) - Wednesday, November 25


Behind the Scenes of Google BigQuery (Toulouse) - Wednesday, November 25


Real-Time Analytics with Spark, Kafka, Cassandra (Copenhagen) - Wednesday, November 25


Pain-Free Agile Hadoop Datahub Development and Operation (Hamburg) - Thursday, November 26


Spark After Dark 1.5 (Budapest) - Thursday, November 26


Real-Time Insights for Advertising Tech Using Apex (Pune) - Wednesday, November 25

Building Distributed Systems (Bangalore) - Saturday, November 28

Time Series Database on Spark (Bangalore) - Saturday, November 28