Data Eng Weekly


Hadoop Weekly Issue #204

12 February 2017

The content in this week's issue is notable because it includes a number of projects at the periphery of the Hadoop ecosystem (pachyderm, MongoDB, RethinkDB, and Hazelcast). If you're just here for the Hadoop ecosystem news, don't worry, there is great coverage of Spark on YARN, PySpark and Apache Arrow, and Kafka Streams.

Technical

Pachyderm, which is a data lake that supports version control for data, includes a data processing API. This article and referenced sample code demonstrate how to use the python API to join two datasets, and it describes how pachyderm takes care of sharding and distributing the data as necessary.

https://medium.com/pachyderm-data/easy-distributed-joins-with-pachyderm-8307bab8a761

While configuring and monitoring a highly-available YARN cluster isn't the easiest thing, it does offer some clear advantages as a platform for running Spark. These include support for multiple versions of Spark, resource management, and resilience. The Altiscale blog elaborates on these and other benefits.

https://www.altiscale.com/blog/why-spark-on-yarn-and-not-standalone/

This post shares an analysis of MongoDB's replication and durability guarantees in the face of Jepsen testing (which introduces network partitions and other failure scenarios). MongoDB's previous replication system has inherent flaws, but a new replication system (based on Raft) fixes the fatal flaws. In fact, no major bugs related to lost updates, dirty reads, or stale reads were found in version 3.4.0 of MongoDB.

https://jepsen.io/analyses/mongodb-3-4-0-rc3

This presentation describes how Apache Arrow speeds up the bridge between python and the JVM for PySpark. The talk starts by describing how PySpark interacts with the Spark execution engine, and it then describes some of the improvements that already exist and some speed ups that are coming down the pike.

http://www.slideshare.net/wesm/improving-python-and-spark-pyspark-performance-and-interoperability

This post demonstrates using Pivotal Cloud Foundry to launch a PySpark application to train a linear model and to launch a python Flask application to serve predictions based on the trained model coefficients.

https://content.pivotal.io/blog/operationalizing-pyspark-data-science-models-on-pivotal-cloud-foundry

Cloudera has a post demonstrating analysis of flight data with sparklyr, the Spark-based backend for dplyr.

http://blog.cloudera.com/blog/2017/02/analyzing-us-flight-data-on-amazon-s3-with-sparklyr-and-apache-spark-2-0/

AWS has published scripts for importing Hive table definitions into Amazon Athena (the Presto-based, hosted big data query engine), and a blog post that describes how to use them.

https://aws.amazon.com/blogs/big-data/migrate-external-table-definitions-from-a-hive-metastore-to-amazon-athena/

The Databricks blog has an example of using the Intel BigDL project for deep learning with Apache Spark. The post describes how to get started, including training the model and evaluating the predictions it makes.

https://databricks.com/blog/2017/02/09/intels-bigdl-databricks.html

This tutorial builds on a Kafka Streams application that consumes demographic data about countries to create a streaming calculation of the top-3 countries by population within each continent.

https://technology.amis.nl/2017/02/12/apache-kafka-streams-running-top-n-grouped-by-dimension-from-and-to-kafka-topic/

News

The first two chapters of "Data Science on the Google Platform" are available as part of the O'Reilly early release.

http://shop.oreilly.com/product/0636920057628.do

The Cloud Native Computing Foundation (CNCF) has purchased the rights to RethinkDB, and they have re-licensed it under the Apache License. This adds a strong distributed database system to the CNCF portfolio of hosted projects which includes Fluentd and Kubernetes.

https://www.joyent.com/blog/the-liberation-of-rethinkdb

Apache Ranger, the security system for the Hadoop ecosystem, has graduated to be a top-level project. The announcement has a good overview of the features Ranger provides as well as some companies that are using it.

https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces3

Hadoop as a Service vendor Qubole has announced that they're now SOC 2 Type II compliant.

http://www.qubole.com/blog/qubole-successfully-completes-soc-2-type-ii-examination/

Spark Summit East was last week in Boston. This post has summaries of and links to videos/slides from a number of sessions at the Summit.

https://databricks.com/blog/2017/02/09/spark-summit-east-2017-another-record-setting-spark-summit.html

Releases

The Hortonworks blog has a recap of the features in the recently released Apache Zeppelin 0.7.0. Key features include improvements to multi-user support, a new pluggable visualzation, and spark improvements (adding support for Spark 2.1).

http://hortonworks.com/blog/welcome-apache-zeppelin-0-7-0

Apache Flink 1.2.0 was released. It's a giant release that resolves over 650 issues but maintains backwards compatibility with all public apis. This post has an overview of the key features, including dynamic scaling of streaming jobs (by restoring from a savepoint), support for running Flink with Apache Mesos, experimental support for encryption in transit, and major improvements to the Table API.

http://flink.apache.org/news/2017/02/06/release-1.2.0.html

In addition to their compliance news, Qubole has announced the general availability of the Qubole Data Service on the Oracle Bare Metal Cloud Service.

http://www.qubole.com/blog/product/now-generally-available-qds-on-oracle-bare-metal-cloud-service/

MapR has announced the MapR Converged Data Platform for Docker, which provides a mechanism for running Docker containers atop of MapR. Using the MapR file system and MapR streams, microservices can be relocated to another server in the cluster without losing any state.

https://www.mapr.com/blog/persistence-age-microservices-introducing-mapr-converged-data-platform-docker

Apache Beam 0.5.0 was released this week. This release adds new apis for State and Timers. According to the JIRA release notes, over 25 bugs were resolved and 25+ improvements and new features are part of the new version.

http://www.mail-archive.com/announce@apache.org/msg03687.html

Hazelcast, makers of the in-memory data grid of the same name, have announced a new open-source distributed processing sysem called Hazelcast Jet. Jet uses cooperative multithreading to take advantage of multi-core CPUs, implements distributed support for java.util.stream, and more. The code is available on GitHub.

https://dzone.com/articles/introducing-hazelcast-jet

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

#SDBigData Meetup #20 (San Diego) - Wednesday, February 15
https://www.meetup.com/sdbigdata/events/236400797/

55th Bay Area Hadoop User Group Meetup (Sunnyvale) - Wednesday, February 15
https://www.meetup.com/hadoop/events/237178448/

Big-Data-As-A-Service: Big Data Analytics on AWS (Santa Clara) - Wednesday, February 15
https://www.meetup.com/Big-Data-as-a-Service/events/237209923/

Stream Processing with Apache Kafka & Apache Samza (Sunnyvale) - Thursday, February 16
https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/237171557/

Mining Member Feedback to Improve the Customer Experience: Nishant Hegde from Netflix (Culver City) - Thursday, February 16
https://www.meetup.com/Los-Angeles-Big-Data-Users-Group/events/236711651/

Washington

Data in Motion with Open Source Apache NiFi (Bellevue) - Wednesday, February 15
https://www.meetup.com/Big-Data-Bellevue-BDB/events/237387396/

Illinois

Keeping Spark on Track: Best Practices Using Apache Spark in Production (Chicago) - Monday, February 13
https://www.meetup.com/acm-chicago/events/237315991/

Virginia

Hybrid Transactional/Analytical Processing Using Spark & In-Memory Data Fabrics (Tysons) - Thursday, February 16
https://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/237076470/

New York

Introduction to Sendence Wallaroo: An Industrial-Grade Streaming Data Platform (New York) - Thursday, February 16
https://www.meetup.com/New-York-City-Storm-User-Group/events/237318240/

CANADA

Distributed Redundant Queueing with Apache Kafka (Kitchener) - Wednesday, February 15
https://www.meetup.com/Intersections-KW/events/236375299/

GERMANY

Apache Flink Meetup (Berlin) - Thursday, February 16
https://www.meetup.com/Apache-Flink-Meetup/events/236896351/

AUSTRIA

Hadoop User Group Meetup (Vienna) - Tuesday, February 14
https://www.meetup.com/Hadoop-User-Group-Vienna/events/236873308/

CZECH REPUBLIC

How It Works at Hortonworks (Prague) - Thursday, February 16
https://www.meetup.com/CS-HUG/events/237217644/

ISRAEL

Resilient Events Handling & Kappa Architecture (Herzeliyya) - Wednesday, February 15
https://www.meetup.com/Big-things-are-happening-here/events/237348726/

SOUTH AFRICA

Apache Kafka at Takealot.com: Use Cases and Production Considerations (Cape Town) - Wednesday, February 15
https://www.meetup.com/meetup-group-cxuwulGL/events/237157052/