Data Eng Weekly

Hadoop Weekly Issue #106

01 February 2015

The two main topics in this week’s newsletter both gained traction in 2014 and will likely be major topics for 2015: Apache Spark and security. On the security front, Cloudera has two posts this week and Hortonworks announced a new data governance initiative. Coverage of Spark includes interviews with Databricks co-founder Ion Stoica and a technical post on streaming k-means.


Mortar (disclosure: Mortar sponsors the event-section of this newsletter) CEO K Young has a post on data lakes, data pipelines, and data directories. Although data lakes are a hot topic right now, K argues that it's better to invest in data pipelines, and he discusses how Luigi is a good solution for building a pipeline.

The Cloudera blog has a post about a new integration for CDH 5.3 between Sentry (the role-based access control layer for Hive) and HDFS ACLs. The post looks at how the integration allows Sqoop and Sentry to co-exist for the first time.

Hortonworks has the third post in a series on predicting airline delays with Hadoop. This post looks at using Scalding and R (previous posts covered Spark and Pig). Like the previous posts, there's an IPython notebook that walks through all the individual steps.

The Hortonworks blog has a post summarizing some recent improvements to YARN that are part of HDP 2.2. Topics include: support for long running applications (Apache Slider), new types of resource management (CPU in addition to RAM slots, node labeling), and improvements to operational support (including rolling upgrades).

"The Morning Paper" is a blog that recaps various computer science papers. This week, it looked at the ZooKeeper paper from 2010. It’s a good overview that serves as supplemental reading material or a refresher if it’s been a while since you read it.

Spark 1.2 introduced a streaming implementation of k-means with the ability to dynamically detect (and remove) clusters over time. The key to this feature is forgetfulness, which is implemented as a half-life parameter to decay old data. The Databricks blog has a post with more details on the algorithm, including several visualizations of it in action.

Cloudera had a post describing the enterprise security features that are part of CDH 5. Topics include Apache Sentry, integration with Active Directory and Kerberos, centralized audit logging, and encryption (plus key management). Not all of these features are available in the free version of CDH, but Cloudera claims many of the features aren't available in another distribution, either.

The Confluent blog has a post from Martin Kleppmann, the author of the upcoming book “Designing Data-Intensive Applications.” The post is a edited transcript of a recent talk on stream processing. It covers a large number of topics, including streaming aggregation, relation to database systems, and several tools. The post is a great overview of important concepts in stream processing.

The Mortar blog has the transcript and video of a recent talk at the NYC Pig User Group. The talk describes the types of problems that Pig is really good for, its shortcomings, and the strengths and weaknesses of the user-facing APIs.

LinkedIn has written about their usage of Kafka and plans for the future. The post provides an insight into what they’re using Kafka for (including monitoring, messaging, analytics) and tools they’ve built around it (a REST API, schema registry, auditing service). Future plans include support for security, improved reliability/availability, and cost efficiency.

The AWS Big Data Blog has a tutorial describing how to setup a Elastic MapReduce cluster with Elasticsearch and Kibana.


The Apache Software Foundation has announced that Samza has graduated from the incubator and is now a top-level project. Samza, the distributed stream processing framework, uses YARN for fault tolerance and integrates with Kafka.

MapR announced an initiative this week to provide free on-demand Hadoop training for developers, analysts, and administrators. Currently available courses are “Hadoop Essentials,” “Hadoop Operations: Cluster Administration,” and “Developing Hadoop Applications. Future courses will cover HBase, Drill, and Hive.

Typesafe and Databricks announced results of a recent survey of Scala and Spark developers. Among the highlights—13% of respondents already have Spark in production and 20% plan to do a production deploy in the coming year. Readwrite has more coverage of the survey, and a follow-up interview with Typesafe’s architect for Big Data Products and Services, Dean Wampler.

TechRepublic has an interview with Ion Stoica, the co-founder of Databricks, about Spark. The post emphasizes Sparks’ versatility—it supports batch, streaming, SQL, and machine-learning. There are a few other interesting tidbits, including mention of Spark support for R in the future and the importance of libraries for Spark.

Hortonworks announced the Data Governance Initiative to develop software to meet enterprise requirements for data governance. Along with Hortonworks, Aetna, Merck, Target, and SAS will be working on the initiative, which will include further integrating Apache Falcon and Apache Ranger.

The Splice Machine RDBMS, which is built atop of HDFS and HBase, is now certified for Hortonworks HDP.


SequenceIQ has announced a new beta release of Cloudbreak, the cloud-agnostic Hadoop-as-a-Service framework. The new version includes user accounts, a usage explorer, support for heterogenous clusters, support for OpenStack, and more.

HFactory is a platform for building HBase-based applications using Scala. This week, version 1.2 was released with a few enhancements and new features.

VoltDB announced version 5.0, which includes expanded Hadoop ecosystem support. Specifically, VoltDB is now integrated with HDFS, MapReduce, and Kafka. It also supports exporting data as Avro.

Cloudera announced bug fix releases of Cloudera Manager (5.2.2 and 5.3.1) and Cloudera Navigator (2.1.2 and 2.2.1).

Cloudera also announced a new release of the Impala ODBC and JDBC drivers. The new versions support HiveServer2 from CDH 4.1+ and Impala 1.0+.


Curated by Mortar Data ( )



Interactive Session on Sparkling Water = Spark + H2O (Mountain View) - Tuesday, February 3

Bayesian Networks with R and Hadoop (Palo Alto) - Wednesday, February 4

Nonstop HBase: Making HBase Safe and Bulletproof by Ryan Rawson of WANDisco (Los Angeles) - Thursday, February 5

Building Real-World Machine Learning Applications with PredictionIO and Spark ML (Mountain View) - Friday, February 6


Hadoop-Based Big Data Analytics with Datameer (Bellevue) - Thursday, February 5


Oozie or Easy: Managing Hadoop Workflows the Easy Way (Tempe) - Wednesday, February 4


Hands-on Spark Workshop for Beginners (Boulder) - Saturday, February 7


Sean Busbey on Apache Accumulo (Austin) - Wednesday, February 4


Machine Learning and Data Ingestion with Apache Storm, Kafka (Oklahoma City) - Thursday, February 5


Performance Tuning Cassandra at Target (Minneapolis) - Monday, February 2


Intro to Hadoop Components and Distributions (Brentwood) - Monday, February 2


Introduction to Big Data Techniques for Cybersecurity (Rockville) - Monday, February 2

Introduction to Apache Accumulo: Architecture and Use Cases (Jessup) - Tuesday, February 3


Get Started with Hadoop Experts: Big Data for Social Good Challenge (Cambridge) - Tuesday, February 3


Greenplum Deep Dive (Toronto) - Tuesday, February 3


Primera Reunión de Apache Spark (Mexico City) - Friday, February 6


First Galway Data Meetup, with Michael Hausenblas of MapR (Galway) - Tuesday, February 3


Spark Meetup at Viadeo (Paris) - Wednesday, February 4

Batch on Hadoop with Cascading (Lyon) - Thursday, February 5


Hadoop and Data Warehouse–Friends, Enemies or Profiteers? What about Real-Time? (Cologne) - Wednesday, February 4


Cassandra: How It Works and What It's Good For! (Vienna) - Wednesday, February 4


Lessons I Learned Building a Big Data Startup (Tel Aviv) - Monday, February 2

Tez vs Spark (Tel Aviv) - Sunday, February 8


Introduction to Spark (Zagreb) - Tuesday, February 3


Big Data Integration Research (Canberra) - Tuesday, February 3