Data Eng Weekly

Hadoop Weekly Issue #79

20 July 2014

This week is full of releases and new products—ranging from Oracle’s new Hadoop-SQL product to a new CDH 5.1 release from Cloudera to new tools for transactions on HBase from Continuuity and deploying Hadoop-as-a-Service from SequenceIQ. There are also a number of quality technical articles covering Spark, Kafka, Luigi, and Hive.


This post covers using the Transformer class to manipulate data as it flows into Sqrrl Enterprise. It details loading the enron email dataset and using a Transformer to build a graph of users sending email. It includes the code for thisTransformer and also some examples of querying the dataset using tools found in Sqrrl Enterprise.

The Databricks blog has the first post in a series on some of the new features of MLlib in Spark 1.0. This post focusses on Spark’s improved support for sparse dataset (both storage and performance improvements). The post has some code examples for pyspark and suggestions for when sparse representations work best.

Jay Kreps (LinkedIn, Kafka architect) recently spoke at Cloduera on Apache Kafka. The Cloudera blog has a summary of his talk, which describes the goals and design of Kafka. The slides for the presentation are also available.

Luigi, the open-source workflow engine from Spotify, is the dark horse in Hadoop workflow engines. This presentation provides a great introduction and overview of Luigi. If you're unhappy with your current engine, I suggest you give it a look.

The Databricks Cloud is a new product announced at the Spark Summit. This post motivates the product (e.g. deploying Hadoop can take a long time) and describes its components. In addition to hosted Spark clusters, the product includes notebooks, dashboards, and a job launcher. There is also a plan for integrating third-party applications.

This post describe how to use Apache Spark for Monte Carlo simulations. It uses the simulations to estimate a financial statistic called value at risk (VaR). The post describes VaR, Monte Carlo simulations, and the Spark program to calculate the value. It includes some example code (the Monte Carlo code is bing added to Spark’s MLLib, but isn’t yet integrated).

The Hortonworks blog has a post on supporting incremental updates for data stored in Hive. Rather than doing SQL UPDATE statements (which Hive does not yet support), the post describes using a base table and an incremental table, which contains updates to the base. These two tables are then reconciled with a Hive VIEW. The post has many more details on how to implement this scenario, including how to use Sqoop to load incremental data.

Another post on the Hortonworks blog covers integrating Kerberos for Hadoop with Active Directory. It details the steps to setup a Kerberos KDC, use Apache Ambari to enable security on the Hadoop cluster, enable the kerberos domain and trust in Active Directory, and enable security in Hue.


SQL-on-Hadoop vendor Hadapt was acquired by Teradata. The deal is rumored to have been worth $50M, and Teradata is supposedly increasing the size of their Boston (the location of Hadapt) office.

Cloudera announced that they’re starting a three-day course called “Cloudera Developer Training for Apache Spark.” The course kicks off in August and costs $2295.

A team of Cloudera employees are working together on a new book entitled “Hadoop Application Architectures.” In early release, the first two chapters covering data modeling and data movement are available via O'Reilly.

This post talks about some of the reasons that Spark is all the rage right now. Based on a talk by MapR CTO M.C. Srivas at Spark Summit, it covers some advantages of Spark and several use-cases that MapR is seeing for Spark. It also discusses some of the advantage that Spark gives of MapReduce for real-time computation.

Videos from the talks at Spark Summit (which took place earlier this month) have been posted on the conference website. Talks cover three tracks—Applications, Developer, and Data Science. There are also a number of keynotes from both days.


Oracle announced Oracle Big Data SQL this week for running queries against data stored across an Oracle Database, a NoSQL data store, and Hadoop. A post on the DBMS2 blog has more details on the implementation (and how it isn't SQL-on-Hadoop as is commonly understood).

Another big vendor announced a SQL and Hadoop integration recently. Datanami has coverage of Trafodion, a recently announced ANSI-compatible SQL project from HP. Trafodion runs atop of HBase, aims to support OLTP, and is open-source (at

Cloudera Enterprise 5.1 was released. CDH 5.1 includes HBase 0.98.1, Spark 1.0, Sentry 1.3, Impala 1.4.0, HUE 3.6, and more. A post on the Cloudera blog discusses some of the security-related improvements. Among them, Cloudera Manager now has an automated workflow for securing a non-secure cluster with Kerberos, HBase has gained cell-level access control, and HDFS has extended ACLs. The full post has more details on Cloudera's grand vision on security as well as how they've integrated the Gazzang offering into Cloudera Navigator.

spark-cassandra-csv is a command-line tool for loading CSV files into Cassandra using Spark.

Version 0.15.0 of the Kite SDK was released this week. The release contains updates to the Datasets api, several updates to the morphlines library, improved documentation, and more.

Cloudera announced support for Apache Accumulo 1.6.0. The release is compatible with both CDH 5 (5.1+) and CDH 4 (4.6+).

Continuuity announced a new open-source project called Tephra. Tephra is a distributed transaction engine for HBase and Hadoop (and is extensible to support other systems like MongoDB). Transactional secondary indexes for HBase are a key use-case that the introductory post highlights.

The SequenceIQ blog has been quite active discussing Hadoop and Docker. This week, they announced Cloudbreak, which provides a cloud-agnostic Hadoop-as-a-Service API using Docker to provision Hadoop. The system also uses Apache Ambari, Serf, and dnsmasq. Cloudbreak has a UI, API, CLI, and a REST-client. Code is available on github, and you can sign up for Cloudbreak on the SequenceIQ website.


Curated by Mortar Data ( )



Meetup at Cloudera (Palo Alto) - Tuesday, July 22

Enterprise Security for Apache Hadoop: Finding and Filling the Gaps (Sunnyvale) - Wednesday, July 23

Accelerate Big Data Application Development with Cascading (San Francisco) - Tuesday, July 22

All-Day Event : "Foundations of Big Data" (San Diego) - Thursday, July 24

Datameer & Cloudera Presents the Big Data Analytics City Tour (San Francisco) - Thursday, July 24

Introduction to Apache Spark for Enterprise Architects (Mountain View) - Thursday, July 24


Impala: MPP SQL Engine for Apache Hadoop & Kite SDK: It's for Developers (Portland) - Wednesday, July 23


Introduction to Spark Course: Intro to Shark (3 of 7) (Austin) - Wednesday, July 23


Hadoop for Newbies (Saint Paul) - Thursday, July 24


Cloudera, Hortonworks, MapR, and Pivotal Come Together to Discuss Apache Spark (Arlington) - Tuesday, July 22


Hands-on Workshop on Distributed Machine Learning and Computing with Spark (Vancouver, B.C.) - Saturday, July 26


Interactive SQL-on-Hadoop: from Impala to Hive/Tez to Spark SQL to JethroData (Tel Aviv) - Monday, July 21


Hadoop Just Got a Lot Sexier - Spark on YARN (Shanghai) - Monday, July 21


Spark, the Most Active Apache Project in Big Data (Madrid) - Wednesday, July 23


Michael Hausenblas: Lambda Architecture with Spark (Berlin) - Thursday, July 24


How YARN Made Hadoop Better (Hyderabad) - Saturday, July 26