27 March 2016
There are several big data conferences taking place over the next few weeks, kicking off with Strata+Hadoop World in San Jose this week. I'm anticipating a lot interesting talks—please let me know if you come across something that I should include in a coming issue.
The Qubole blog has a guest post about how MediaMath moved their data infrastructure from a traditional MPP-based data warehouse to a cloud-based solution that decoupled storage and compute. This "data liberation" enabled several teams—data science, client analytics, engineering, and product—to have better access to data.
https://www.qubole.com/blog/big-data/moving-past-infrastructure-limitations/
This post describes best practices and several alternatives for using HBase to store recommendation information. It covers namespace design, rowkey design, column family/column design, and more.
http://datacentric.pl/big-data/design-hbase-data-model-recommendations/
MapR has a great introduction to using PySpark with Pandas. With example data from BigML, the post shows how to build a decision tree with both Spark's MLlib and the newer Spark ML package. The tutorial switches between Spark and Pandas API calls fairly frequently, as each has strengths that complements the other pretty well.
https://www.mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-packages
The AWS Big Data Blog has a quick intro to a new feature of Apache Zeppelin 0.5.6-incubating—the ability to import and export JSON descriptors of notebooks.
Also on the AWS blog, this tutorial describes how to use PySpark, Hue, and Hive to implement anomaly detection over sensor data. Steps include k-means clustering, inspecting k-means output to determine the appropriate number of clusters, identifying anomalies by calculating a distance measure, and manually inspecting the anomalies.
This post shows how to make use of the Hadoop CredentialProvider API for storing the password to the Hive metastore database.
https://www.mapr.com/blog/5-steps-to-remove-hive-metastore-password
In the second post in a series on the upcoming Kafka Streams library, this post shows how to use the Kafka Streams DSL. The DSL provides several familiar functional programming methods like mapValues, flatMap, and filter as well as join and aggregation capabilities. The code turns out to be quite concise and readable, in part thanks to the use of new Java 8 language features like method handles.
http://codingjunkie.net//kafka-streams-part2/
Using the analogy "will diesel locomotives replace train tracks?" Pepperdata CEO Sean Suchter explains that "Will spark replace Hadoop?" both is the wrong question to ask and doesn't make sense. He has several alternative suggests, such as "what jobs can I now run more effectively?" and guidance for an organization exploring the adoption of Spark.
Cloudera had their Analyst Day this past week, and this post has coverage of the event. Hot topics included open source, business value, the Cloduera-Intel partnership, the cloud, and data science.
https://cloudpul.se/posts/cloudera-analyst-day-2016-new-strategies-mainstream-hadoop-world
Sense, makers of a data science and analytics software platform, announced that they've been acquired by Cloudera.
http://blog.sense.io/sense-joins-cloudera/
Airflow, the workflow automation tool built at Airbnb, has been submitted to the Apache incubator.
"Making Sense of Stream Processing" is a new report from O'Reilly author Martin Kleppmann. Confluent is sponsoring a free download of the eBook, behind an email-wall.
http://www.confluent.io/making-sense-of-stream-processing-ebook
DataEngConf is taking place in San Francisco April 7-8. The conference aims to bridge the gap between data engineers and data scientists, and it has tracks for those two areas.
dotScale takes place in Paris on April 25th. There are a number of talks from folks in the Hadoop and big data space.
BlueData has released a new version of their EPIC software platform for managing Hadoop infrastructure. The release focusses on QoS, security & data governance, fine-grained storage controls, and an app workbench.
http://www.bluedata.com/blog/2016/03/announcing-the-bluedata-epic-spring-release/
Scio is a new Scala API for Google Cloud Dataflow from Spotify. Its API is inspired by Spark and Scalding, and is integrated with several Google Cloud projects as well as Algebird and Breeze.
https://github.com/spotify/scio/releases/tag/v0.1.3
The Google Cloud Platform made a number of announcements this week. Among them, Google Cloud Bigtable added support for hard disk drives (in addition to SSDs), Cloud Dataflow announce Python support, and BigQuery cut pricing for historical data (data older than 90 days) storage in half and improved query performance (they introduced a new storage engine, which can do some interesting things like evaluate filters without decompressing data).
https://cloud.google.com/blog/big-data/2016/03/cloud-bigtable-now-supports-hdd-storage-for-big-analytics-workloads-at-lower-cost
https://cloud.google.com/blog/big-data/2016/03/google-announces-cloud-dataflow-with-python-support
https://cloud.google.com/blog/big-data/2016/03/google-bigquery-cuts-historical-data-storage-cost-in-half-and-accelerates-many-queries-by-10x
Version 0.3 of KeystoneML, the framework for end-to-end machine learning pipelines built on Apache Spark, was released this week. The new version includes several new optimizations, a new linear system solver, and several new operators.
http://keystone-ml.org/release.html#version-03---2016-03-24
Curated by Datadog ( http://www.datadog.com )
Low-Latency Ingestion and Analytics with Kafka and Apex (San Jose) - Monday, March 28
http://www.meetup.com/Apex-Bay-Area-Chapter/events/228716763/
Kafka and Data Science (San Francisco) - Monday, March 28
http://www.meetup.com/SF-Data-Science/events/228185306/
March Kafka Meetup (San Jose) - Tuesday, March 29
http://www.meetup.com/http-kafka-apache-org/events/229424437/
Apache Hadoop: The Next 10 Years, with Doug Cutting (San Jose) - Tuesday, March 29
http://www.meetup.com/SF-Bay-Areas-Big-Data-Think-Tank/events/228439749/
Elasticsearch and Hadoop (Mountain View) - Tuesday, March 29
http://www.meetup.com/Silicon-Valley-Elastic-Fantastics/events/229228842/
Taking Hadoop and Spark to the Cloud + Beer Tasting! (San Jose) - Tuesday, March 29
http://www.meetup.com/BigDataDevelopers/events/229424567/
Spark Meetup at Strata (San Jose) - Tuesday, March 29
http://www.meetup.com/spark-users/events/229353452/
Building an ETL Pipeline from Scratch in 30 Mins (San Francisco) - Wednesday, March 30
http://www.meetup.com/SF-Data-Science/events/229557678/
Rapid Data Analytics @ Netflix (Los Angeles) - Wednesday, March 30
http://www.meetup.com/Los-Angeles-Big-Data-Users-Group/events/229299585/
Stream Processing Using Heron + an Introduction to SparkR (San Francisco) - Thursday, March 31
http://www.meetup.com/San-Francisco-AWS-Big-Data-Meetup/events/229637068/
Data Engineering Architecture at Simple with Rob Story (Portland) - Tuesday, March 29
http://www.meetup.com/Portland-Data-User-Group/events/229249381/
Mesos and Big Data: Where Does the Rubber Meet the Road (Addison) - Monday, March 28
http://www.meetup.com/Metroplex-Mesos-Group/events/228349615/
Discuss Spark Core, SparkSQL, and Interactive Notebooks with Spark (Madison) - Tuesday, March 29
http://www.meetup.com/BigDataMadison/events/223452447/
MemSQL on "Building Real-Time Data Pipelines" (Charlotte) - Wednesday, March 30
http://www.meetup.com/CharlotteHUG/events/225229241/
Is Spark Replacing MapReduce? (Arlington) - Tuesday, March 29
http://www.meetup.com/VA-DC-MD-NoSQL/events/229441725/
Moving from Microsoft SQL to Hive (Washington) - Wednesday, March 30
http://www.meetup.com/Washington-DC-Apache-Hive-Users-Group/events/226015860/
SnappyData + Spark = Real Time Analytics, Machine Learning, Streaming, OLTP (Hamilton) - Monday, March 28
http://www.meetup.com/nj-hadoop/events/229479907/
March Presentation Night (Boston) - Tuesday, March 29
http://www.meetup.com/Boston-Apache-Spark-User-Group/events/229578393/
Toronto Apache Spark #7 (Toronto) - Wednesday, March 30
http://www.meetup.com/Toronto-Apache-Spark/events/229099552/
A Spark Tutorial (London) - Thursday, March 31
http://www.meetup.com/Spark-London/events/229636441/
Decomposing the SMACK Stack Part One: Spark and Mesos (Stockholm) - Thursday, March 31
http://www.meetup.com/Stockholm-Spark/events/229742018/
Spark and SPSS (Madrid) - Thursday, March 31
http://www.meetup.com/Big-Data-Developers-in-Madrid/events/229230134/
Spark Meetup (Paris) - Tuesday, March 29
http://www.meetup.com/Paris-Spark-Meetup/events/229847857/
Analytics with Cassandra and PySpark (Amsterdam) - Thursday, March 31
http://www.meetup.com/Amsterdam-Spark/events/229280010/
Basics of Spark Coding (Budapest) - Wednesday, March 30
http://www.meetup.com/Budapest-Spark-Meetup/events/229462749/