Data Eng Weekly

Hadoop Weekly Issue #81

03 August 2014

Apache Ambari, which is cluster management software for Apache Hadoop, is a big topic this week. In addition to news that Pivotal and Hortonworks are teaming up to collaborate on Ambari, there are a string of technical articles on it. Another hot topic this week is Apache Spark—specifically its machine learning library, MLlib. Finally, it’s also worth noting that this week’s issue includes links to a few papers from the Proceedings of the VLDB Endowment.


The Databricks blog has a post showing how to write concise Spark (using the python library, pyspark) code to solve a complicated problem. Specifically, it show how to the Alternative Least Squares implementation of Spark’s machine learning library, MLlib, to build recommendations. The post also has some details on scaling the process across a 16-node cluster and how it compares to Apache Mahout’s implementation.

SequenceIQ’s blog also has a post on Spark’s MLlib, this time focussing on clustering with the K-Means algorithm. The post has example code to use the k-means api and for running a job (and more source code on github). The post also talks about how Apache Tez shows a lot of promise for iterative/machine learning algorithms such as k-means.

The first of several posts on Apache Ambari in this week’s newsletter covers Ambari’s tools for expanding a cluster. Specifically, it covers adding hosts to a cluster, adding components (e.g. DataNode) to an existing host, replacing hosts, and adding new services (e.g. HBase) to a cluster.

The Hortonworks blog has a post on the future of Apache Ambari, for which Hortonworks is trying to rally the community around three areas: operate, integrate, and extend. This includes Ambari Stacks, the Ambari REST API, and Ambari Views—each of which is described in more detail in the post.

The Kite SDK, which provides libraries, tools, and more for Hadoop development, recently released version 0.15.0. A post on the Cloudera blog highlights some of the new features and improvements in the release. Among them are enriched support for working with datasets by URI (which also cleans up the syntax used to invoke several command line tools), improved support for configuration of MR and Crunch jobs, and a new parent pom for maven projects using Kite.

This post discusses how container virtualization is a good fit for running applications in YARN. By providing isolation, security, consistency, and more, containers can solve a lot of operational issues (e.g. not polluting the host configuration with various versions of packages or configuration). In particular, the post describes how docker, which supports distributing pre-built images, is a good solution to distributing YARN applications.

A post on the Cloudera blog highlights some of the changes and new features in Apache Spark 1.0. Specifically, it covers the API incompatibilities vs previous versions, the new spark-submit command, Avro support, the Spark History Server, and Spark SQL. There’s also a discussion of the stability of Spark, which the author admits is not as stable or mature as other Hadoop ecosystem components.

The August edition of the Proceedings of the VLDB Endowment was recently published online. There are several papers that could be interesting to readers of this newsletter, including an experimental analysis of graph-procesing libraries (including Apache Giraph and GraphLab), a new similarity join framework for MapReduce called ClusterJoin, and an experimental analysis of Hive on Tez vs. Cloudera Impala from IBM Research.

SequenceIQ recently unveiled an open-source, docker-based, Hadoop-as-a-Service framework called Cloudbreak. A guest post on the Hortonworks blog explores why they chose Ambari to provision Hadoop clusters in Cloudbreak and provides an introduction to the Ambari shell.


Pivotal and Hortonworks announced that they’re teaming up to collaborate on Apache Ambari. Pivotal’s distribution includes its own Pivotal Command Center, but the announcement suggests that they’ll be moving towards Apache Ambari instead. Pivotal will allocate engineering resources to Apache Ambari to help “strengthen Hadoop as an enterprise offering."

The GigaOm Structure Show has an interview with Hortonworks CEO Rob Bearden in which he discusses the recent investment by HP in Hortonworks. The linked article has some excerpts from the interview, in which Bearden explains how Hortonworks’ HDP fits into the HP Haven (which includes Vertica and Autonomy in addition to Hadoop) architecture.

TechRepublic has an article about the history of Cloudera, which was (one of?) the first companies to offer commercial support for Hadoop. It talks a lot about the early days of Cloudera, including Cloudera’s early hiring practices (which put an emphasis on open-source community).

Datanami has an article on Guavus Reflex, a data processing and analytics system that’s powered by Hadoop and Spark. Reflex is used by telcos to identify and respond to network outages, among other things. The article talks about the hybrid in-memory streaming and batch processing system that powers Reflex and how it recently adopted Apache Spark.

IBM has published benchmarks of its SQL-on-Hadoop system, Big SQL, against Apache Hive 0.12. While vendor benchmarks should often be taken with a grain of salt, there are some interesting details in the post such as Big SQL’s support for Parquet and ORCFile formats.


Apache Ambari 1.6.1 was released this week. The new version includes a large number of bug fixes and improvements (over 500 Jiras closed!), several of which are highlighted on the Hortonworks blog. Those include performance improvements, additional DNS configuration checks, and simplified database connection setup.

Cloudera announced two new versions of Cloudera Manager, containing a number of bug fixes. Cloudera Manager 4.8.4 solves issues with directory quotas, activity monitors, Hive, and more. Cloudera Manager 5.1.1 fixes a bug with installing JCE policy files.


Curated by Mortar Data ( )



Pepperdata Meetup at SQUARE in San Francisco (Sunnyvale) - Wednesday, August 6

Making Hadoop Realtime by Dr. William Bain,CEO of Scaleout Software (Los Angeles) - Wednesday, August 6


Spark at eBay - Troubleshooting the everyday issues (Bellevue) - Wednesday, August 6


Using HBase Co-Processors to Build a Transactional SQL-on-Hadoop RDBMS (Houston) - Wednesday, August 6

Introduction to Spark Course: SparkML - machine learning library (5 of 7) (Austin) - Wednesday, August 6


MPP Appliance or Hadoop/Hive? What's Your Enterprise Data Warehouse Solution? (Phoenix) - Wednesday, August 6


IBM - Big SQL with Hands-On Lab (Columbus) - Thursday, August 7


Securing Hadoop Deployments to Enterprise Compliance Regulations (Ballwin) - Saturday, August 9


HIVE Evolution - Latest on HIVE, SPARK, TEZ and YARN (Saint Petersburg) - Wednesday, August 6


Rethink Data - and what your business can do (Boston) - Tuesday, August 5


NoSQL for SQL Professionals (Toronto) - Tuesday, August 5

Pick and choose: Database and Big Data technologies in Bluemix (Toronto) - Wednesday, August 6

Introduction to Apache Hive (Kanata) - Tuesday, August 5


M.C. Srivas: Rethinking SQL for Big Data (Sydney) - Tuesday, August 5

'Rethinking SQL for Big Data' by another VIP speaker, M. C. Srivas (MapR) (Melbourne) - Thursday, August 7


Big Data Analytics in Scala (Hong Kong) - Monday, August 4


BigData.SG/HadoopSG August Meetup (Singapore) - Thursday, August 7


Discuss About the Big Data and How It impacting the Industries (Chandigarh) - Saturday, August 9