Hadoop Weekly Issue #163

27 March 2016

There are several big data conferences taking place over the next few weeks, kicking off with Strata+Hadoop World in San Jose this week. I'm anticipating a lot interesting talks—please let me know if you come across something that I should include in a coming issue.

Technical

The Qubole blog has a guest post about how MediaMath moved their data infrastructure from a traditional MPP-based data warehouse to a cloud-based solution that decoupled storage and compute. This "data liberation" enabled several teams—data science, client analytics, engineering, and product—to have better access to data.

https://www.qubole.com/blog/big-data/moving-past-infrastructure-limitations/

This post describes best practices and several alternatives for using HBase to store recommendation information. It covers namespace design, rowkey design, column family/column design, and more.

http://datacentric.pl/big-data/design-hbase-data-model-recommendations/

MapR has a great introduction to using PySpark with Pandas. With example data from BigML, the post shows how to build a decision tree with both Spark's MLlib and the newer Spark ML package. The tutorial switches between Spark and Pandas API calls fairly frequently, as each has strengths that complements the other pretty well.

https://www.mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-packages

The AWS Big Data Blog has a quick intro to a new feature of Apache Zeppelin 0.5.6-incubating—the ability to import and export JSON descriptors of notebooks.

http://blogs.aws.amazon.com/bigdata/post/Tx1Y66KB4QZTVJL/Import-Zeppelin-notes-from-GitHub-or-JSON-in-Zeppelin-0-5-6-on-Amazon-EMR

Also on the AWS blog, this tutorial describes how to use PySpark, Hue, and Hive to implement anomaly detection over sensor data. Steps include k-means clustering, inspecting k-means output to determine the appropriate number of clusters, identifying anomalies by calculating a distance measure, and manually inspecting the anomalies.

http://blogs.aws.amazon.com/bigdata/post/Tx2642DKK75JBP8/Anomaly-Detection-Using-PySpark-Hive-and-Hue-on-Amazon-EMR

This post shows how to make use of the Hadoop CredentialProvider API for storing the password to the Hive metastore database.

https://www.mapr.com/blog/5-steps-to-remove-hive-metastore-password

In the second post in a series on the upcoming Kafka Streams library, this post shows how to use the Kafka Streams DSL. The DSL provides several familiar functional programming methods like mapValues, flatMap, and filter as well as join and aggregation capabilities. The code turns out to be quite concise and readable, in part thanks to the use of new Java 8 language features like method handles.

http://codingjunkie.net//kafka-streams-part2/

News

Using the analogy "will diesel locomotives replace train tracks?" Pepperdata CEO Sean Suchter explains that "Will spark replace Hadoop?" both is the wrong question to ask and doesn't make sense. He has several alternative suggests, such as "what jobs can I now run more effectively?" and guidance for an organization exploring the adoption of Spark.

http://www.information-management.com/news/big-data-analytics/are-you-asking-all-the-wrong-questions-about-apache-spark-10028483-1.html

Cloudera had their Analyst Day this past week, and this post has coverage of the event. Hot topics included open source, business value, the Cloduera-Intel partnership, the cloud, and data science.

https://cloudpul.se/posts/cloudera-analyst-day-2016-new-strategies-mainstream-hadoop-world

Sense, makers of a data science and analytics software platform, announced that they've been acquired by Cloudera.

http://blog.sense.io/sense-joins-cloudera/

Airflow, the workflow automation tool built at Airbnb, has been submitted to the Apache incubator.

http://mail-archives.apache.org/mod_mbox/incubator-general/201603.mbox/%3CCA+_b2+tJW_ci-vHx=7=hK6V6yNFuL2gH_SHWSwdYRqYN-9HD8A@mail.gmail.com%3E

"Making Sense of Stream Processing" is a new report from O'Reilly author Martin Kleppmann. Confluent is sponsoring a free download of the eBook, behind an email-wall.

http://www.confluent.io/making-sense-of-stream-processing-ebook

DataEngConf is taking place in San Francisco April 7-8. The conference aims to bridge the gap between data engineers and data scientists, and it has tracks for those two areas.

http://www.dataengconf.com/

dotScale takes place in Paris on April 25th. There are a number of talks from folks in the Hadoop and big data space.

http://www.dotscale.io/

Releases

BlueData has released a new version of their EPIC software platform for managing Hadoop infrastructure. The release focusses on QoS, security & data governance, fine-grained storage controls, and an app workbench.

http://www.bluedata.com/blog/2016/03/announcing-the-bluedata-epic-spring-release/

Scio is a new Scala API for Google Cloud Dataflow from Spotify. Its API is inspired by Spark and Scalding, and is integrated with several Google Cloud projects as well as Algebird and Breeze.

https://github.com/spotify/scio/releases/tag/v0.1.3

The Google Cloud Platform made a number of announcements this week. Among them, Google Cloud Bigtable added support for hard disk drives (in addition to SSDs), Cloud Dataflow announce Python support, and BigQuery cut pricing for historical data (data older than 90 days) storage in half and improved query performance (they introduced a new storage engine, which can do some interesting things like evaluate filters without decompressing data).

https://cloud.google.com/blog/big-data/2016/03/cloud-bigtable-now-supports-hdd-storage-for-big-analytics-workloads-at-lower-cost
https://cloud.google.com/blog/big-data/2016/03/google-announces-cloud-dataflow-with-python-support
https://cloud.google.com/blog/big-data/2016/03/google-bigquery-cuts-historical-data-storage-cost-in-half-and-accelerates-many-queries-by-10x

Version 0.3 of KeystoneML, the framework for end-to-end machine learning pipelines built on Apache Spark, was released this week. The new version includes several new optimizations, a new linear system solver, and several new operators.

http://keystone-ml.org/release.html#version-03---2016-03-24

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Low-Latency Ingestion and Analytics with Kafka and Apex (San Jose) - Monday, March 28
http://www.meetup.com/Apex-Bay-Area-Chapter/events/228716763/

Kafka and Data Science (San Francisco) - Monday, March 28
http://www.meetup.com/SF-Data-Science/events/228185306/

March Kafka Meetup (San Jose) - Tuesday, March 29
http://www.meetup.com/http-kafka-apache-org/events/229424437/

Apache Hadoop: The Next 10 Years, with Doug Cutting (San Jose) - Tuesday, March 29
http://www.meetup.com/SF-Bay-Areas-Big-Data-Think-Tank/events/228439749/

Elasticsearch and Hadoop (Mountain View) - Tuesday, March 29
http://www.meetup.com/Silicon-Valley-Elastic-Fantastics/events/229228842/

Taking Hadoop and Spark to the Cloud + Beer Tasting! (San Jose) - Tuesday, March 29
http://www.meetup.com/BigDataDevelopers/events/229424567/

Spark Meetup at Strata (San Jose) - Tuesday, March 29
http://www.meetup.com/spark-users/events/229353452/

Building an ETL Pipeline from Scratch in 30 Mins (San Francisco) - Wednesday, March 30
http://www.meetup.com/SF-Data-Science/events/229557678/

Rapid Data Analytics @ Netflix (Los Angeles) - Wednesday, March 30
http://www.meetup.com/Los-Angeles-Big-Data-Users-Group/events/229299585/

Stream Processing Using Heron + an Introduction to SparkR (San Francisco) - Thursday, March 31
http://www.meetup.com/San-Francisco-AWS-Big-Data-Meetup/events/229637068/

Oregon

Data Engineering Architecture at Simple with Rob Story (Portland) - Tuesday, March 29
http://www.meetup.com/Portland-Data-User-Group/events/229249381/

Texas

Mesos and Big Data: Where Does the Rubber Meet the Road (Addison) - Monday, March 28
http://www.meetup.com/Metroplex-Mesos-Group/events/228349615/

Wisconsin

Discuss Spark Core, SparkSQL, and Interactive Notebooks with Spark (Madison) - Tuesday, March 29
http://www.meetup.com/BigDataMadison/events/223452447/

North Carolina

MemSQL on "Building Real-Time Data Pipelines" (Charlotte) - Wednesday, March 30
http://www.meetup.com/CharlotteHUG/events/225229241/

Virginia

Is Spark Replacing MapReduce? (Arlington) - Tuesday, March 29
http://www.meetup.com/VA-DC-MD-NoSQL/events/229441725/

District of Columbia

Moving from Microsoft SQL to Hive (Washington) - Wednesday, March 30
http://www.meetup.com/Washington-DC-Apache-Hive-Users-Group/events/226015860/

New Jersey

SnappyData + Spark = Real Time Analytics, Machine Learning, Streaming, OLTP (Hamilton) - Monday, March 28
http://www.meetup.com/nj-hadoop/events/229479907/

SWEDEN

Decomposing the SMACK Stack Part One: Spark and Mesos (Stockholm) - Thursday, March 31
http://www.meetup.com/Stockholm-Spark/events/229742018/

SPAIN

Spark and SPSS (Madrid) - Thursday, March 31
http://www.meetup.com/Big-Data-Developers-in-Madrid/events/229230134/

FRANCE

Spark Meetup (Paris) - Tuesday, March 29
http://www.meetup.com/Paris-Spark-Meetup/events/229847857/

NETHERLANDS

Analytics with Cassandra and PySpark (Amsterdam) - Thursday, March 31
http://www.meetup.com/Amsterdam-Spark/events/229280010/

HUNGARY

Basics of Spark Coding (Budapest) - Wednesday, March 30
http://www.meetup.com/Budapest-Spark-Meetup/events/229462749/

Data Eng Weekly