Data Eng Weekly

Hadoop Weekly Issue #163

27 March 2016

There are several big data conferences taking place over the next few weeks, kicking off with Strata+Hadoop World in San Jose this week. I'm anticipating a lot interesting talks—please let me know if you come across something that I should include in a coming issue.


The Qubole blog has a guest post about how MediaMath moved their data infrastructure from a traditional MPP-based data warehouse to a cloud-based solution that decoupled storage and compute. This "data liberation" enabled several teams—data science, client analytics, engineering, and product—to have better access to data.

This post describes best practices and several alternatives for using HBase to store recommendation information. It covers namespace design, rowkey design, column family/column design, and more.

MapR has a great introduction to using PySpark with Pandas. With example data from BigML, the post shows how to build a decision tree with both Spark's MLlib and the newer Spark ML package. The tutorial switches between Spark and Pandas API calls fairly frequently, as each has strengths that complements the other pretty well.

The AWS Big Data Blog has a quick intro to a new feature of Apache Zeppelin 0.5.6-incubating—the ability to import and export JSON descriptors of notebooks.

Also on the AWS blog, this tutorial describes how to use PySpark, Hue, and Hive to implement anomaly detection over sensor data. Steps include k-means clustering, inspecting k-means output to determine the appropriate number of clusters, identifying anomalies by calculating a distance measure, and manually inspecting the anomalies.

This post shows how to make use of the Hadoop CredentialProvider API for storing the password to the Hive metastore database.

In the second post in a series on the upcoming Kafka Streams library, this post shows how to use the Kafka Streams DSL. The DSL provides several familiar functional programming methods like mapValues, flatMap, and filter as well as join and aggregation capabilities. The code turns out to be quite concise and readable, in part thanks to the use of new Java 8 language features like method handles.


Using the analogy "will diesel locomotives replace train tracks?" Pepperdata CEO Sean Suchter explains that "Will spark replace Hadoop?" both is the wrong question to ask and doesn't make sense. He has several alternative suggests, such as "what jobs can I now run more effectively?" and guidance for an organization exploring the adoption of Spark.

Cloudera had their Analyst Day this past week, and this post has coverage of the event. Hot topics included open source, business value, the Cloduera-Intel partnership, the cloud, and data science.

Sense, makers of a data science and analytics software platform, announced that they've been acquired by Cloudera.

Airflow, the workflow automation tool built at Airbnb, has been submitted to the Apache incubator.

"Making Sense of Stream Processing" is a new report from O'Reilly author Martin Kleppmann. Confluent is sponsoring a free download of the eBook, behind an email-wall.

DataEngConf is taking place in San Francisco April 7-8. The conference aims to bridge the gap between data engineers and data scientists, and it has tracks for those two areas.

dotScale takes place in Paris on April 25th. There are a number of talks from folks in the Hadoop and big data space.


BlueData has released a new version of their EPIC software platform for managing Hadoop infrastructure. The release focusses on QoS, security & data governance, fine-grained storage controls, and an app workbench.

Scio is a new Scala API for Google Cloud Dataflow from Spotify. Its API is inspired by Spark and Scalding, and is integrated with several Google Cloud projects as well as Algebird and Breeze.

The Google Cloud Platform made a number of announcements this week. Among them, Google Cloud Bigtable added support for hard disk drives (in addition to SSDs), Cloud Dataflow announce Python support, and BigQuery cut pricing for historical data (data older than 90 days) storage in half and improved query performance (they introduced a new storage engine, which can do some interesting things like evaluate filters without decompressing data).

Version 0.3 of KeystoneML, the framework for end-to-end machine learning pipelines built on Apache Spark, was released this week. The new version includes several new optimizations, a new linear system solver, and several new operators.


Curated by Datadog ( )



Low-Latency Ingestion and Analytics with Kafka and Apex (San Jose) - Monday, March 28

Kafka and Data Science (San Francisco) - Monday, March 28

March Kafka Meetup (San Jose) - Tuesday, March 29

Apache Hadoop: The Next 10 Years, with Doug Cutting (San Jose) - Tuesday, March 29

Elasticsearch and Hadoop (Mountain View) - Tuesday, March 29

Taking Hadoop and Spark to the Cloud + Beer Tasting! (San Jose) - Tuesday, March 29

Spark Meetup at Strata (San Jose) - Tuesday, March 29

Building an ETL Pipeline from Scratch in 30 Mins (San Francisco) - Wednesday, March 30

Rapid Data Analytics @ Netflix (Los Angeles) - Wednesday, March 30

Stream Processing Using Heron + an Introduction to SparkR (San Francisco) - Thursday, March 31


Data Engineering Architecture at Simple with Rob Story (Portland) - Tuesday, March 29


Mesos and Big Data: Where Does the Rubber Meet the Road (Addison) - Monday, March 28


Discuss Spark Core, SparkSQL, and Interactive Notebooks with Spark (Madison) - Tuesday, March 29

North Carolina

MemSQL on "Building Real-Time Data Pipelines" (Charlotte) - Wednesday, March 30


Is Spark Replacing MapReduce? (Arlington) - Tuesday, March 29

District of Columbia

Moving from Microsoft SQL to Hive (Washington) - Wednesday, March 30

New Jersey

SnappyData + Spark = Real Time Analytics, Machine Learning, Streaming, OLTP (Hamilton) - Monday, March 28


March Presentation Night (Boston) - Tuesday, March 29


Toronto Apache Spark #7 (Toronto) - Wednesday, March 30


A Spark Tutorial (London) - Thursday, March 31


Decomposing the SMACK Stack Part One: Spark and Mesos (Stockholm) - Thursday, March 31


Spark and SPSS (Madrid) - Thursday, March 31


Spark Meetup (Paris) - Tuesday, March 29


Analytics with Cassandra and PySpark (Amsterdam) - Thursday, March 31


Basics of Spark Coding (Budapest) - Wednesday, March 30