Data Eng Weekly


Hadoop Weekly Issue #146

22 November 2015

This week is a short, holiday week in the US—and a lot of projects shipped new releases before going on break. Also, there are technical articles about Flume, Spark, YARN, and Sentry. Finally, Cloudera unveiled proposals for Impala and Kudu to join the Apache Incubator.

Technical

This post contains some tips for working with Flume. Specifically, it looks at how channels hookup to sources and sinks, how to achieve fanout, and how to modify the flume JVM arguments.

https://developer.ibm.com/hadoop/blog/2015/11/16/flume-tips/

The Altiscale blog has an update on progress to improve the support for launching Docker containers from YARN. The LinuxContainerExecutor in trunk now supports running containers via Docker based on a run-time option. The feature is targeted for the 2.8 release of Apache Hadoop.

https://www.altiscale.com/blog/launching-docker-containers-with-the-linuxcontainerexecutor/

This post describes some of the pros and cons of Apache Sentry, the authorization system for Hadoop clusters. It notes a number of limitations / wish list items (some of which are already in progress or implemented, but not yet released), particularly for Solr integration, HDFS/Hive synchronization, and additional systems.

http://getindata.com/blog/post/what-is-missing-in-apache-sentry-incubating/

Oftentimes, Spark is described as a panacea for everyone's big data problems. While it's quite nice in many situations, this post describes five issues that crop up in production. They include memory issues, small files problems, strange low-level errors, and more.

http://www.infoworld.com/article/3004460/application-development/5-things-we-hate-about-spark.html

This presentation, from QCon SF, describes the highlights of several seminal papers in distributed systems related to eventual consistency and system verification. For each of the papers, there's a list of key-takeaways, which provides great context for picking up a new paper.

https://speakerdeck.com/randommood/we-hear-you-like-papers

News

This week, Cloudera submitted proposals for Impala and Kudu to join the Apache Software Foundation Incubator. The Cloudera blog describes the evolution of Impala—both in terms of features and external contributions. Cloudera also affirms that they believe Kudu, Impala, and Spark will be important core components of Hadoop in the long-term.

http://blog.cloudera.com/blog/2015/11/impalas-next-step-proposal-to-join-the-apache-software-foundation/

Releases

Splice Machine has launched a version 2.0 of its RDBMS built on Hadoop. The new version adds support for Spark, the ability to mix simultaneous OLAP and OLTP workloads, a priority level mechanism to best utilize resources, and a management console to monitor queries.

http://wwpi.com/new-version-2-0-of-splice-machine-rdbms-offers-hybrid-in-memory-architecture-powered-by-hadoop-spark/

Hivemall, the library of machine learning functions for Apache Hive, recently released version 0.4.0. The new release includes support for Factorization Machine and Random Forest classification/regression.

https://github.com/myui/hivemall/releases/tag/v0.4.0-2

Apache Flink 0.10.0 was released this week. The new version contains a number of new features, and the DataStream API is now considered production-ready. Among the new features, Flink 0.10.0 adds support for event-time stream processing (i.e. considering event time rather than ingestion time), stateful stream processing, high-availablility, a new web dashboard, off-heap managed memory, outer joins, and much more.

http://flink.apache.org/news/2015/11/16/release-0.10.0.html

Amazon EMR 4.2.0 adds new versions of Apache Spark, Presto, Apache Zeppelin, and Apache Oozie. The AWS blog has more details on the features of the new versions.

https://aws.amazon.com/blogs/aws/amazon-emr-update-apache-spark-1-5-2-ganglia-presto-zeppelin-and-oozie/

Qubole has added the ability to share Spark RDDs across multiple jobs as part of their Spark Job Server API. To power this feature, Qubole is using Apache Zeppelin.

https://www.qubole.com/blog/product/share-rdds-across-jobs-with-quboles-spark-job-server/

Version 0.5.5-incubating of Apache Zeppelin was released. Zeppelin is a web-based notebook service with support for data analysis using various backends, such as Apache Spark.

http://zeppelin.incubator.apache.org/releases/zeppelin-release-0.5.5-incubating.html

Cloudera Enterprise 5.5 was released this week, with a the new Cloudera Navigator Optimizer (beta), security improvements (such as column-level security in Impala and Hive), improved performance/scale/operations, and more. The new release updates versions of Apache Spark, Apache Flume, Apache Sqoop, Apache Sentry, HUE, and Impala. It also adds support for RHEL 7 and MariaDB.

http://blog.cloudera.com/blog/2015/11/cloudera-enterprise-5-5-is-now-generally-available/

As part of Cloudera Enterprise 5.5, Cloudera released Impala 2.3. The biggest feature of the new release is support for querying of complex/nested data types when stored in Apache Parquet tables. The Cloudera blog has a post describing the SQL extensions that they've implemented to this end, including several examples.

http://blog.cloudera.com/blog/2015/11/new-in-cloudera-enterprise-5-5-support-for-complex-types-in-impala/

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

Illinois

Flink 0.10: Graduating the Streaming API (Chicago) - Tuesday, November 24
http://www.meetup.com/Chicago-Apache-Flink-Meetup/events/226602856/

CANADA

Beyond Shuffling: Tips & Tricks for Scaling Apache Spark (Vancouver) - Monday, November 23
http://www.meetup.com/Vancouver-Spark/events/226762108/

Toronto Apache Spark #3 (Toronto) - Wednesday, November 25
http://www.meetup.com/Toronto-Apache-Spark/events/225772157/

A Sneak Peek Into Spark 1.6: From RDD to DataFrames to Datasets (Vancouver) - Wednesday, November 25
http://www.meetup.com/Vancouver-Spark/events/226606414/

Architecture Review Session (Toronto) - Friday, November 27
http://www.meetup.com/TorontoHUG/events/226673073/

IRELAND Greenplum, Apache HAWQ, MPP vs Hadoop, Modern Data Architecture (Dublin) - Saturday, November 28
http://www.meetup.com/hadoop-user-group-ireland/events/226674070/

UNITED KINGDOM

Security Considerations in Hadoop and Big Data + More (London) - Tuesday, November 24
http://www.meetup.com/hadoop-users-group-uk/events/226589899/

Autoscaling Spark + Spark Execution Model (London) - Thursday, November 26
http://www.meetup.com/Spark-London/events/226374209/

SWEDEN

Spark After Dark 1.5 with Chris Fregly (Stockholm) - Monday, November 23
http://www.meetup.com/Stockholm-Spark/events/226278686/

Flink Streaming at Ericsson Research (Kista) - Thursday, November 26
http://www.meetup.com/Apache-Flink-Stockholm/events/226507642/

SPAIN

SparkR + H2O (Barcelona) - Wednesday, November 25
http://www.meetup.com/Spark-Barcelona/events/226722575/

FRANCE

Behind the Scenes of Google BigQuery (Toulouse) - Wednesday, November 25
http://www.meetup.com/Tlse-Data-Science/events/225773094/

DENMARK

Real-Time Analytics with Spark, Kafka, Cassandra (Copenhagen) - Wednesday, November 25
http://www.meetup.com/Big-Data-Denmark/events/226443867/

GERMANY

Pain-Free Agile Hadoop Datahub Development and Operation (Hamburg) - Thursday, November 26
http://www.meetup.com/BDNSHH/events/226343615/

HUNGARY

Spark After Dark 1.5 (Budapest) - Thursday, November 26
http://www.meetup.com/Big-Data-Meetup-Budapest/events/226365384/

INDIA

Real-Time Insights for Advertising Tech Using Apex (Pune) - Wednesday, November 25
http://www.meetup.com/Apache-Apex-incubating-Meetup-Pune/events/226506211/

Building Distributed Systems (Bangalore) - Saturday, November 28
http://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/226671584/

Time Series Database on Spark (Bangalore) - Saturday, November 28
http://www.meetup.com/Real-Time-Data-Processing-and-Cloud-Computing/events/226855069/