Hadoop Weekly Issue #178

10 July 2016

This week's issue covers quite the medley of big data topics: Apache Kafka, tips for migration from Apache Pig to Apache Spark, SparkR, a tour of ten Apache stream processing projects, Apache Kudu (incubating), and more. There are also several releases this week, including new open source tools from Yahoo (distributed optimization problem solver for Spark) and LinkedIn (tools for Apache Kafka). And if you missed the recent Hadoop Summit, videos of presentations have been published to youtube.

Technical

Confluent has an overview of their Control Center project, which provides a web-based Kafka Connect configuration and a Kafka monitoring tool to ensure that data from producers reaches all consumers. The Control Center is a paid add-on for Apache Kafka, and it is included as part of Confluent Platform Enterprise.

http://www.confluent.io/blog/introducing-confluent-control-center

This post introduces Apache Spark to Apache Pig developers. It describes some of the trade-offs between the two, and it has line-by-line examples written in both Pig and PySpark. The examples demonstrate some advanced topics like full outer joins, SparkSQL (imposing a schema on a RDD), integration with the Hive metastore, user defined functions, map-side join, and window functions. Even if you're not a Pig developer, this is a great introduction to PySpark.

https://philippedecuzey.wordpress.com/2016/06/05/fromapachepigtospark/

The Confluent Log Compaction post has highlights of in-progress improvements to Apache Kafka and links to several recent Kafka-related presentations/blog posts. New features targeted for the next release include a new time-based index and improved timeout handling. Work on adding new security features and improving Kafka Streams is also underway.

http://www.confluent.io/blog/log-compaction-highlights-in-the-apache-kafka-and-stream-processing-community-july-2016

Amazon S3 popularized object storage. By offering a subset of features of a normal file system, it has enormous scale and provides very high availability/durability. This post looks at why object storage is blossoming outside of AWS, and highlights three popular open-source systems.

https://opensource.com/life/16/7/object-storage

WePay has written about their analytics pipeline built on Google BigQuery and Apache Airflow (incubating). Airflow powers ETL from MySQL to BigQuery as well as production analytics.

https://wecode.wepay.com/posts/wepays-data-warehouse-bigquery-airflow

The Databricks blog has a recap of a recent tutorial on SparkR. The tutorial covered data exploration and advanced analytics. Both parts of the tutorial are available as notebooks with inline descriptions of the various calculations performed.

https://databricks.com/blog/2016/07/07/sparkr-tutorial-at-user-2016.html

Amazon has written about how they implement personalized recommendations using Spark and the Deep Scalable Sparse Tensor Neural Engine (DSSTNE). Spark is the driver, which runs within Amazon EMR and synchronizes data to S3. From S3, the data is processed by DSSTNE on GPU nodes managed by Amazon Elastic Container Service and Auto Scaling Group. DSSTNE is open-source, and there's a walkthrough of using a similar setup with the MovieLens dataset via an Apache Zeppelin notebook.

http://blogs.aws.amazon.com/bigdata/post/TxGEL8IJ0CAXTK/Generating-Recommendations-at-Amazon-Scale-with-Apache-Spark-and-Amazon-DSSTNE

Salesforce recently open sourced Runway, which is a tool for modeling and simulating distributed systems. This code introduces a runway model for the popular Apache BookKeeper project.

https://github.com/salesforce/runway-model-bookkeeper

Whether you're relatively new to the big data ecosystem or not, it can be really difficult to keep track of all the relevant Apache projects. When it comes to scalable stream processing, there are ten different ASF projects. The New Stack has an overview and sample use case for each (Flume, Flink, Beam, Apex, Ignite, Kafka Streams, Nifi, Samza, Spark, and Storm).

http://thenewstack.io/apache-streaming-projects-exploratory-guide/

AgilData has a post that enumerates several of Apache Kudu's (incubating) differentiators—from being based on the Raft distributed consensus algorithm to baked in optimizations for SSDs to first-class support for SQL. The post also argues that although Kudu is in beta, it's ready now for production use cases.

http://www.agildata.com/10-reasons-we-like-kudu-as-part-of-your-big-data-strategy/

News

Videos of presentations from the recent Hadoop Summit have been posted on Youtube. There are over 150 presentations and keynotes.

https://www.youtube.com/playlist?list=PLKnYDs_-dq16K1NH83Bke2dGGUO3YKZ5b

The Syncsort blog has two-part interview with Holden Karau, Spark committer and author of O'Reilly books. The discussion covers data hubs, data formats, mainframes, Spark 2.0, Machine Learning, and more.

http://blog.syncsort.com/2016/06/big-data/ibms-holden-karau-on-hadoop-etl-machine-learning-and-the-future-of-spark/
http://blog.syncsort.com/2016/06/big-data/expert-interview-series-ibms-holden-karau-hadoop-etl-machine-learning-future-spark-part-2/

Talend, the big data integration company, has filed for an IPO to raise $86.25 million.

http://siliconangle.com/blog/2016/07/04/big-data-integration-platform-provider-talend-files-for-an-ipo/

A few years ago, it was a generally accepted that Hadoop ran best on bare metal in the data center. Advances in virtualization, improvements to big data software for cloud storage, and cost reductions have recently made the cloud much more practical. Qubole has an overview of some of the reasons why you might want to build your big data system in the cloud.

https://www.qubole.com/blog/big-data/cloud-infrastructure/

Releases

The MongoDB Connector for Apache Spark is now generally available. The connector has many advanced features, and the announcement includes a tutorial illustrating several of these.

https://www.mongodb.com/blog/post/the-new-mongodb-connector-for-apache-spark-in-action-building-a-movie-recommendation-engine

Version 2.4.0 of Spring for Apache Hadoop was released. This release adds support for Hortonworks HDP 2.4, for YARN resource labels, and more.

https://dzone.com/articles/spring-for-apache-hadoop-240-ga-released

Yahoo has open sourced SparkADMM, a system for solving optimization problems using the Alternating Directions Method of Multipliers.

https://yahooresearch.tumblr.com/post/147013834176/open-sourcing-sparkadmm-a-massively-parallel

Cloudera released version 2.1 of Cloudera Director, their tool for running Hadoop clusters in cloud environments. This new version adds support for Microsoft Azure, cross-region and cross-cloud deployments, and usage-based billing.

http://blog.cloudera.com/blog/2016/07/whats-new-in-cloudera-director-2-1/

Oracle released version 12.2.0.1.1 of Oracle GoldenGate for Big Data. GoldenGate is a tool for mirroring an Oracle database to Apache kafka.

https://java.net/projects/oracledi/downloads/directory/GoldenGate/Oracle%20GoldenGate%20Adapter%20for%20Kafka%20Connect

LinkedIn has open sourced their kafka-asssigner tool for managing cluster partitions (removing brokers, rebalancing partitions).

https://github.com/linkedin/kafka-tools

Version 2.2.0 of Luigi, the open-source workflow tool for Hadoop and other data systems. This release has many bug fixes, improvements to integration with AWS, Salesforce, and FTP, support for MSSQL, the ability to print the dependency tree as ascii art, and much more.

https://github.com/spotify/luigi/releases/tag/2.2.0

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Dr. Elephant: Pills for Your Problematic Hadoop Jobs (San Francisco) - Tuesday, July 12
http://www.meetup.com/hadoopsf/events/231281073/

Building/Running Netflix's Data Pipeline Using Apache Kafka (Redwood City) - Wednesday, July 13
http://www.meetup.com/SF-Big-Analytics/events/231072452/

Building a Machine Learning Pipeline Using Aerosolve (San Francisco) - Wednesday, July 13
http://www.meetup.com/SF-Spark-and-Friends/events/232021862/

Expert Panel on Streaming Analytics Technologies (San Francisco) - Thursday, July 14
http://www.meetup.com/Data-Engineers-Guild/events/231878713/

Recent Developments in SparkR for Advanced Analytics (Sunnyvale) - Friday, July 15
http://www.meetup.com/Silicon-Valley-Machine-Learning/events/232460747/

New York

Streaming and Akka Persistence + Cassandra Availability Management (New York) - Tuesday, July 12
http://www.meetup.com/Reactive-New-York/events/232063482/

Robust Stream Processing with Apache Flink (New York) - Wednesday, July 13
http://www.meetup.com/NYCFlink/events/232306611/

CANADA

Introduction to Spark (Montreal) - Wednesday, July 13
http://www.meetup.com/Scala-Montreal/events/232347742/

IRELAND Apache Flink... Don’t Cross the Streams! Modern Data Science Workflows (Dublin) - Monday, July 11
http://www.meetup.com/hadoop-user-group-ireland/events/232023668/

UNITED KINGDOM

Join Us to Learn about Apache Ignite (London) - Wednesday, July 13
http://www.meetup.com/Apache-Ignite-London/events/231963106/

ITALY

Data-Intensive Recommenders and Machine Learning Applications in Spark & Flink (Milan) - Wednesday, July 13
http://www.meetup.com/Data-Science-Milan/events/232218802/

INDIA

Understanding and Building Big Data Architectures, Part 3: Messaging/Kafka (Hyderabad) - Saturday, July 16
http://www.meetup.com/hyderabad-scalability/events/229886391/

NEW ZEALAND

Apache Spark Meetup (Wellington) - Wednesday, July 13
http://www.meetup.com/Wellington-Spark-Meetup/events/232379389/

Data Eng Weekly