Data Eng Weekly


Hadoop Weekly Issue #134

16 August 2015

This week's newsletter is quite short (folks in the northern hemisphere must be enjoying their summer!), but there are a couple of great articles. Specifically, two technical posts give practical advice based on real-world experience. Also, there are a few releases, including a new Gradle plugin for Hadoop that was open-sourced this week by LinkedIn.

Technical

The Cloudera blog has a guest post from Barclays about how they moved from SQL to Spark and Scala to improve the computational speed and development workflow for their Insights Engine. The post describes the problem, the solution, and provides a number of tips for working with Scala and Spark: an introduction to functional programming, understanding the resource constraints in Spark, suggestions for efficient memory representations, and more.

http://blog.cloudera.com/blog/2015/08/how-apache-spark-scala-and-functional-programming-made-hard-problems-easy-at-barclays/

This tutorial describes how to migrate data from MySQL to Cassandra using PySpark and the Spark Cassandra connector. In addition to the code required for the migration, the post discusses schema design in Cassandra and explains how to denormalize one of the tables.

http://rustyrazorblade.com/2015/08/migrating-from-mysql-to-cassandra-using-spark/

The SparkOnHBase code, previously part of Cloudera Labs, has been integrated into (an unreleased version of) Apache HBase. This post describes the implementation and API of the new module and discusses some areas of future work.

http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/

The Qubole blog has a guest post which describes a recent evaluation of several SQL engines for Hadoop. Unlike many other benchmarks, this one focusses on Hadoop in the cloud. Specifically, they looked at Spark SQL and Presto on four different file formats. Like all benchmarks, it's usually best to try things out yourself, but in this case they found Spark SQL was the best fit. The post describes the evaluation criteria (which include a few notes specific to Amazon S3) and also why Pearson is using Qubole.

http://www.qubole.com/blog/product/sql-on-hadoop-evaluation-by-pearson/

This post gives an overview of Apache Spark DataFrames with example translations from Pandas DataFrames. Regardless of your familiarity with Pandas, the post is a good overview of column projection, adding columns, filtering, aggregation, and windowing operations.

https://databricks.com/blog/2015/08/12/from-pandas-to-apache-sparks-dataframe.html

News

Databricks has introduced the Databricks Academic Partners program, which provides free access to the Databricks platform for teaching and research.

https://databricks.com/blog/2015/08/11/announcing-the-databricks-academic-partners-program.html

InfoWorld has an article describing several common projects for which companies are using Hadoop and Spark. These include specialized analysis, Hadoop as a service, streaming analytics, complex event processing, and streaming ETL.

http://www.infoworld.com/article/2969911/application-development/the-7-most-common-hadoop-and-spark-projects.html

In a good complement to the previous post, this post describes several concrete examples of real-time applications powered by Spark. These include fraud detection, network security, ad processing, and medical applications.

https://www.mapr.com/blog/game-changing-real-time-use-cases-apache-spark-on-hadoop

On August 27, 2015, the HadoopSphere Virtual Conclave, which is a virtual conference covering Hadoop, Spark, and Tajo, is taking place.

http://conclave.hadoopsphere.com/

This post looks at the big data stack at WebTrends—they've adopted a number of the key technologies that have gained momentum over the past year. Specifically, they're running Spark on YARN in the cloud. This has helped them keep down costs and improve performance. The post also talks about some of the security-related features of Spark.

http://hortonworks.com/blog/how-spark-and-open-enterprise-hadoop-drive-business-value-at-webtrends/

Releases

A new release of HP Vertica and the Haven Big Data Platform includes enhanced support for Apache Hadoop and an integration with Apache Kafka. Specifically, the system can run SQL queries directly against data stored in ORCFiles in HDFS and supports ingestion from Kafka for real-time analysis.

http://siliconangle.com/blog/2015/08/11/hp-targets-speedy-new-vertica-release-at-real-time-hadoop-clusters/

Cloudera Director, the system for running Hadoop in the cloud, released version 1.5 this week. The new release adds support for the Google Cloud Platform (and a plugin interface to support additional providers), improved security and customization, and more.

http://blog.cloudera.com/blog/2015/08/whats-new-in-cloudera-director-1-5/

The Google Cloud Dataflow and Cloud Pub/Sub systems are now out of beta and are geenrally available. Dataflow is a system for streaming and batch analysis that is fully managed and Cloud Pub/Sub provides a mechanism to link various services and APIs (including DataFlow).

http://googlecloudplatform.blogspot.com/2015/08/Announcing-General-Availability-of-Google-Cloud-Dataflow-and-Cloud-Pub-Sub.html

LinkedIn has open-sourced their Gradle plugin for Hadoop. The plugin and accompanying DSL are useful for developing Hadoop workflows with jobs in various frameworks.

http://engineering.linkedin.com/hadoop/open-sourcing-linkedin-gradle-plugin-and-dsl-apache-hadoop

WANdisco Fusion 2.6 was released this week. The new version includes support for network shaping and prioritization for replication across data centers.

http://www.digitaljournal.com/pr/2643163

Promo

O'Reilly is offering readers of Hadoop Weekly a 20% discount on any pass to the upcoming Strata + Hadoop World with discount code HADOOPW. The conference takes place September 29 - October 1st in New York. See the link below for the agenda and speaker lineup.

http://strataconf.com/big-data-conference-ny-2015/public/schedule/presentations?cmp=mp-data-confreg-info-stny15_hadoopweekly_schedule_pc

Events

Curated by Datadog ( http://www.datadoghq.com )

UNITED STATES

California

Large Scale Distributed ML on Spark (Santa Clara) - Thursday, August 20
http://www.meetup.com/spark-users/events/223361529/

Spark Streaming & Kafka: The Future of Stream Processing (Santa Monica) - Thursday, August 20
http://www.meetup.com/Los-Angeles-Apache-Spark-Users-Group/events/223927337/

Colorado

Self-Service Data Exploration and Nested Data Analytics: Introduction to Drill (Denver) - Wednesday, August 19
http://www.meetup.com/Boulder-Denver-Big-Data/events/224539459/

Missouri

SOLR and Cloudera Search (St. Louis) - Tuesday, August 18
http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/223440380/

Michigan

Practical Tips on Running Spark on Hadoop & Machine Learning in the Wild (Ann Arbor) - Thursday, August 20
http://www.meetup.com/Michigan-Spark-Users-Group/events/224251488/

Ohio

Experiences with Spark 1.4 and R (Mason) - Wednesday, August 19
http://www.meetup.com/Cincinnati-Apache-Spark-Meetup/events/224144550/

Georgia

Document Classification on Apache Spark (Atlanta) - Wednesday, August 19
http://www.meetup.com/Atlanta-Hadoop-Users-Group/events/224454272/

Virginia

Spark Jeopardy at Zoomdata! (Reston) - Tuesday, August 18
http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/223939521/

UNITED KINGDOM

Using Numerical Libraries on Spark (London) - Tuesday, August 18
http://www.meetup.com/Spark-London/events/224509495/

NETHERLANDS

Introduction Into Apache Spark (Leidschendam) - Tuesday, August 18
http://www.meetup.com/dev-070/events/223649096/

INDIA

A Deep Dive Into Apache Spark Internals (Hyderabad) - Saturday, August 22
http://www.meetup.com/Big-Data-Hyderabad/events/223277944/

CHINA

Shanghai Big Data Streaming 1st Meetup (Shanghai) - Saturday, August 22
http://www.meetup.com/Shanghai-Big-Data-Streaming-Meetup/events/224418388/

Apache Spark Startup (Xian) - Saturday, August 22
http://www.meetup.com/Xian-Apache-Spark-Meetup/events/224326895/

AUSTRALIA

Data-Intensive Applications with Hadoop and Spark (Sydney) - Thursday, August 20
http://www.meetup.com/Women-Who-Code-Sydney/events/224420806/