Data Eng Weekly


Hadoop Weekly Issue #139

27 September 2015

Strata + Hadoop World is this week in NYC, and there are a large number of related meetups taking place for those in NYC. In anticipation of the conference, there are a few new releases, and we can expect to see many more announcements this week. I won't be attending, so please send along any interesting news and presentations that you see.

Technical

If you like reading about distributed systems or are interested in learning more about the CAP theorem, then Martin Kleppmann's "A Critique of the CAP Theorem" is for you. It discusses the theorem and many of the common confusions in terminology. It then proposes an alternative to the CAP theorem, which is aimed at helping practitioners reason about common trade-offs.

http://arxiv.org/abs/1509.05393

The Apache blog describes building an Apache NiFi flow that ingests tweets from the Twitter API, does some light-weight processing, and stores the resulting tweets into Solr. It demonstrates some of NiFi's built-in tools, such as json evaluation and batching.

https://blogs.apache.org/nifi/entry/indexing_tweets_with_nifi_and

The Databricks blog has a post that gives an overview of Spark's implementation of Latent Dirichlet Allocation (LDA). Spark implements an online variant of the algorithm, which improves performance and scalability. The post links to example code on github and provides a number of tips for using LDA.

https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-improvements-to-lda-on-spark.html

Spark Testing Base is a library for testing Spark code in Scala and Java. This post gives an overview of the functionality, which includes the ability to test non-trivial jobs (such a Spark streaming).

http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/

This post articulates several reasons why it's a good idea to invest in operating a centralized schema registry for a data platform. Reasons include enforcing safe schema evolution, storage efficiency, data discovery, and data policy enforcement. The post also describes why it's critical for stream processing.

http://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one

Erasure codings are a well-known mechanism of data protection that can incur less overhead than Hadoop's three-way replication. Adding this to HDFS was proposed over five years ago, and engineers from Cloudera and Intel are working on it for the upcoming Hadoop 3.0 release. This blog post has an in-depth overview of the strategy and implementation, which takes advantage of hardware acceleration for encoding and decoding parity data.

http://blog.cloudera.com/blog/2015/09/introduction-to-hdfs-erasure-coding-in-apache-hadoop/

Hue includes Livy, a REST interface for interacting with Spark. This post describes how to start Livy to run Spark jobs, and it gives examples of starting a Spark shell and entering commands via the REST api.

http://gethue.com/how-to-use-the-livy-spark-rest-job-server-for-interactive-spark/

Unlike java or scala libraries, python libraries often aren't portable across machines. This can cause problems for a distributed computation with PySpark, but there are a few strategies to distribute the necessary libraries. This post describes them (e.g. shipping a py file, py egg, setting up a virtualenv on each node) and when each is most appropriate.

http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluster-for-pyspark-jobs/

This post describes Coursera's data infrastructure, which ties together Cassandra, Scalding, Amazon Redshift, and more. They use Dataduct, which is a python framework for the AWS Data Pipeline to manage workflows.

http://blogs.aws.amazon.com/bigdata/post/Tx2Q3JGH427TL8Z/How-Coursera-Manages-Large-Scale-ETL-using-AWS-Data-Pipeline-and-Dataduct

News

The O'Reilly Radar blog has a post about how the Apache Drill project grew a community and how the community helped shape the project. For example, an early design meeting was streamed for remote participants outside of the bay area.

http://radar.oreilly.com/2015/09/apache-drill-tracking-its-history-as-an-open-source-community.html

Venturebeat reports that Cloudera is working on a new storage engine called Kudu, which aims to have features fitting between HBase and HDFS.

http://venturebeat.com/2015/09/24/cloudera-kudu/

MapR has a post introducing Apache Flink. The article describes the origins of the projects, the meaning of the name "Flink," and Flink's event-based stream processing. On the topic of stream processing, it compares when streaming makes sense as compared to micro-batching.

https://www.mapr.com/blog/apache-flink-new-way-handle-streaming-data

Releases

Version 0.3.0 of Apache NiFi, the data processing and distribution system, was released this week. This release includes performance improvements, integration with Ambari, support for processing images, support for Kerberos Hadoop clusters, and new Avro capabilities.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201509.mbox/%3CCAFddr26dEGVxSDwRtG-3Efj0wX7MF+EBoLRKwFoUAFrhu_+3Pg@mail.gmail.com%3E

Spark-Timeseries is a new library for working with time series data from spark. It provides an abstraction for time series datasets and includes support for various manipulation functions (e.g. aligning, missing value imputation) and stats/models (such as exponentially weighted moving average).

http://cloudera.github.io/spark-timeseries/

Apache Sentry 1.6.0-incubating was released this week. Sentry is a system for fine-grained access control in Hadoop, and the new release adds a Sqoop2 integration, a new dump/load tool, and more. The release also contains a number of bug fixes and improvements.

https://blogs.apache.org/sdp/entry/apache_sentry_1_6_0

Apache Accumulo, the distributed key-value store, released version 1.5.4. The bug-fix release includes a fix for a data-loss bug.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201509.mbox/%3C5600BE0B.1020104@apache.org%3E

Version 2.0 of BlueData EPIC was announced. The release switched to a docker-based deployment system, which provides the flexibility of managing a cluster of virtualized machines in addition to physical machines. Other highlights include support for Apache Zeppelin and an app store for installing partner applications.

http://www.bluedata.com/blog/2015/09/introducing-bluedata-epic-2-0/

Google Cloud Dataproc is a new offering from the Google Cloud Platform for deploying Hadoop and Spark clusters. The system is integrated with Google's other cloud services and is priced at 1 cent per virtual CPU per hour (atop of normal instance cost).

http://googlecloudplatform.blogspot.com/2015/09/Google-Cloud-Dataproc-Making-Spark-and-Hadoop-Easier-Faster-and-Cheaper.html

Cask has released version 3.2 of the Cask Data Application Platform. The new release includes Cask Hydrator—a framework and UI for batch/real-time data ingestion and ETL, new auditing and lineage support, views, and more.

http://blog.cask.co/2015/09/announcing-cdap-3-2-hydrator-and-much-more/

Cascading-Flink is a new project to use Apache Flink as the execution engine for Cascading flows. Key features include sophisticated memory management (reduce the risk for OutOfMemoryErrors) and performance improvements for flows with type information. The project doesn't yet support hash-based outer joins and it relies on a development version of Apache Flink.

http://www.cascading.org/cascading-flink/

Apache Hadoop 2.6.1 was released with critical fixes which have been back-ported from the 2.7 and 2.8 development trees.

http://mail-archives.apache.org/mod_mbox/hadoop-general/201509.mbox/%3CCAMyYaRKbVddxP9-X%3DxLBVsQS5PvL-isLnYKrkw0BzABKfkcNxQ%40mail.gmail.com%3E

Events

Curated by Datadog ( http://www.datadoghq.com )

UNITED STATES

California

Scalable Machine Learning at Yahoo (San Jose) - Monday, September 28
http://www.meetup.com/SF-Bay-ACM/events/223655195/

Introduction to BigQuery (Clovis) - Thursday, October 1
http://www.meetup.com/googledevelopers/events/220039294/

Arizona

Enterprise Dataflow with Apache NiFi (Tempe) - Thursday, October 1
http://www.meetup.com/Phoenix-Hadoop-User-Group/events/225278294/

Illinois

Spark DataFrames (Chicago) - Tuesday, September 29
http://www.meetup.com/Chicago-Spark-Users/events/225509902/

Wisconsin

Learn about Improvements in Apache Spark (Madison) - Tuesday, September 29
http://www.meetup.com/BigDataMadison/events/223205350/

Georgia

Apache Ranger for Securing Hadoop (Atlanta) - Wednesday, September 30
http://www.meetup.com/Atlanta-Hadoop-Users-Group/events/224789065/

North Carolina

September CHUG Event: SnapLogic (Charlotte) - Wednesday, September 30
http://www-.meetup.com/CharlotteHUG/events/219153242/

New York

Rethinking SQL for Big Data with Apache Drill (New York) - Monday, September 28
http://www.meetup.com/Hadoop-NYC/events/224931207/

One Hadoop, Multiple Clouds (New York) - Monday, September 28
http://www.meetup.com/big-data/events/225017863/

Meetup at Strata + Hadoop World NYC 2015 (New York) - Monday, September 28
http://www.meetup.com/Sentry-User-Meetup/events/225292617/

Best Practices for PySpark, with Juliet Hougland of Cloudera (New York) - Tuesday, September 29
http://www.meetup.com/NYC-Data-Science/events/224075052/

Using Python at Scale for Data Science, with Wes McKinney (New York) - Tuesday, September 29
http://www.meetup.com/NYC-Open-Data/events/225011954/

Hadoop World NYC 2015 (New York) - Tuesday, September 29
http://www.meetup.com/Apache-Kafka-NYC/events/223419893/

Resolving Transactional Access/Analytic Performance Trade-Offs in Hadoop (New York) - Tuesday, September 29
http://www.meetup.com/Hadoop-NYC/events/224102527/

Committer Night: Spark 1.5 and Beyond (New York) - Tuesday, September 29
http://www.meetup.com/New-York-Spark-Meetup/events/225041216/

HBase Meetup (New York) - Tuesday, September 29
http://www.meetup.com/HBase-NYC/events/223636134/

Oryx 2: Lambda Architecture on Spark, Kafka for Real-Time Large Scale ML (New York City) - Tuesday, September 29
http://www.meetup.com/NYC-Machine-Learning/events/225586834/

Impala Lightning Talks in NYC (New York) - Tuesday, September 29
http://www.meetup.com/Bay-Area-Impala-Users-Group/events/223917097/

Hadoop World 2015 (New York) - Wednesday, September 30
http://www.meetup.com/Apache-Mesos-NYC-Meetup/events/224248923/

MADlib + HAWQ for Advanced SQL Machine Learning on Hadoop (New York) - Thursday, October 1
http://www.meetup.com/Pivotal-NY/events/225074025/

Twitter Heron: Stream Processing at Scale (New York) - Thursday, October 1
http://www.meetup.com/New-York-City-Storm-User-Group/events/225001338/

FRANCE

1st Meetup of Hadoop User Group Rennes (Rennes) - Wednesday, September 30
http://www.meetup.com/Hadoop-User-Group-Grand-Ouest/events/225226952/

Hadoop Meetup Sur La Seine (Paris) - Thursday, October 1
http://www.meetup.com/Hadoop-User-Group-France/events/225606051/

GERMANY

Cascading on Flink & Tracking the Trackers with Flink (Berlin) - Wednesday, September 30
http://www.meetup.com/Apache-Flink-Meetup/events/225282775/

HUNGARY

Big Data Meetup: September 2015 (Budapest) - Monday, September 28
http://www.meetup.com/Big-Data-Meetup-Budapest/events/224725378/

ROMANIA

5th BigData/DataScience Cluj-Napoca Meetup (Cluj-Napoca) - Wednesday, September 30
http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/events/225393112/