Hadoop Weekly Issue #38

06 October 2013

This week's issue has a lot of interesting content, but I'd like to highlight two things in particular. Hivemall and Hourglass are new Hive and MapReduce (respectively) frameworks released this week. I hope to see more application library-level frameworks like these in the future to help minimize duplicate effort across organizations.

Technical

Alex Holmes, the author of Hadoop in Practice, presented at JavaOne on the next generation of Hadoop. His blog contains a link to his slides as well as an overview of several technologies that he covered in his talk. It features everything from YARN to ElephantDB to Summingbird to Spark. There's a description of each one and links to additional reading. The list is a great way to catch up on all the new components in the Hadoop ecosystem.

http://grepalex.com/2013/09/25/javaone-2013-nextgen-hadoop/

The Hortonworks blog has the fourth in a series on Apache Tez, the generalized computation framework built on YARN. This post covers Tez's Input-Process-Output API and data flow. In particular, it focuses on how data is passed between objects in the framework and error handling. The post describes pretty low-level APIs that won't be used by the typical Hadoop user, but it's a good peak under the hood for those wanting to understand how Tez works.

http://hortonworks.com/blog/writing-a-tez-inputprocessoroutput-2/

Joe Stein presented on the new features in Cassandra 2.0 at the NYC Cassandra User Group. His talk covers lightweight transactions, eager retries, and triggers. The slides have a lot of details about these new features as well as info on improvements to compaction and CQL.

http://www.slideshare.net/charmalloc/apache-cassandra-20

The DataStax blog features a post on jvm tooling. The post is aimed at Cassandra developers, but it is applicable to anyone working with disk or memory hungry systems on linux in the JVM. Specifically, the post features some tips for measuring max disk throughput, measuring throughput over time with iostat, and memory analysis of a JVM heap.

http://www.datastax.com/dev/blog/tools-for-testing-cassandra

Donald Miner presented at the Twin Cities Hadoop User Group on his book MapReduce Design Patterns. The presentation covers design patterns in general, MapReduce design patterns, and deep dives into two patterns "Top Ten" and "Bloom Filtering."

http://www.slideshare.net/DonaldMiner/mapreduce-design-patterns

Speaking of MapReduce Design patterns, one common pattern is to compute some metrics over a sliding-window of data -- e.g. the past 30 days. Some folks at LinkedIn have released a new framework called Hourglass that generalizes this type of problem. Hourglass is being added to LinkedIn's open-source Hadoop library, DataFu. The post walks through key concepts in the framework, some example implementations in Java, and provides a walkthrough for running Hourglass on synthetic data.

http://engineering.linkedin.com/datafu/datafus-hourglass-incremental-data-processing-hadoop

Cassandra 2.0.2 has a new feature called "Rapid read protection," which is a mechanism to reduce latency in the case of node failure or degradation. By keeping track of response time distributions from each node in the cluster, Cassandra's coordinator node can attempt a 'speculative retry' when a response from the first node takes longer than expected (the criterion is tunable). As noted in the article, the strategy is similar to MapReduce's speculative execution. This is an often-sought after feature in online systems, and a lot of companies end up implementing something like it themselves. Thus, it's really exciting to see this baked into Cassandra (and hopefully more systems will adopt it, too).

http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2

Hadoop MapReduce is often used for building search engine indexes (this was the original use-case for which MapReduce and Hadoop were designed). This post covers integrating Pig and ElasticSearch by generating Lucene indexes offline and adding them to ElasticSearch's data directory. Such an integration has been available for a while using ElasticSearch's REST API, but this is a completely different approach that is likely more practical in certain situations. The post covers using Twitter's ElephantBird and Apache Pig to generate Lucene indexes and how to add them to ElasticSearch.

http://www.poudro.com/blog/building-an-elasticsearch-index-offline-using-hadoop-pig/

News

Cloudera announced that they are supporting and integrating with Apache Accumulo. Accumulo is a distributed key-value store with similar design properties to HBase with additional support for fine-grained authorization at the cell-level of data it stores. Cloudera announced a private-beta of Accumulo for CDH in July, and they are one of the first vendors to publicly announce support for Accumulo. There's no word yet if Accumulo will be bundled with CDH4/5 or if Cloudera plans to support an Apache release (although evidence points to the latter).

http://www.marketwired.com/press-release/cloudera-announces-support-for-apache-accumulo-1835834.htm

Cassandra Summit EU starts in 10 days in London. The full schedule has been announced, and it includes a number of interesting talks across three tracks -- Architect, DataOps, and Developer.

http://www.datastax.com/cassandraeurope2013/sessions#h1

Cloudera CTO Amr Awadallah spoke with Silicon Angle about Cloudera's vision of a "single platform where you can store data of any kind." He remarks that Cloudera, in an effort to address customer needs, is focusing on stability and reliability of their platform rather than on adding new features. He also speaks about HBase vs. Accumulo and Cloudera's relationship with Splunk.

http://siliconangle.com/blog/2013/10/02/cloudera-wants-to-build-the-iphone-of-big-data-splunkconf/

Syncsort, makers of the high performance mainframe to HDFS software DMX-h, have announced an acquisition of Circle Computer Group. Circle makes mainframe data migration software. In an article about the acquisition, there are a number of interesting quotes from Syncsort CEO Lonne Jaffe. In particular, he speaks about how they are trying to corner the mainframe to Hadoop market, something that they stumbled upon a few years ago and few others are paying attention to.

http://www.datanami.com/datanami/2013-09-30/syncsort_bolsters_mainframe-to-hadoop_play_with_circle_buy.html

Mike Olson, CSO of Cloudera, wrote a post about the role of open-source at Cloudera. Mike has had a lot of first-person experience with open-source through his years working on Postgres and Berkeley DB. The post walks through those experiences as well as detailing a number of other commercial open-source offering (like Red Hat Enterprise Linux). Mike explains that Cloudera believes that "platform software has to be open source," but that there's room for closed-source software that it will use to differentiate.

https://www.linkedin.com/today/post/article/20131003190011-29380071-the-cloudera-model

Releases

Snakebite, the python HDFS client implementation from Spotify, released version 1.3.2. This update contains changes to command-line argument handling and a fix for fully-qualified hdfs urls.

https://github.com/spotify/snakebite/blob/1.3.2/debian/changelog

Hivemall is a new library of Hive UDFs for machine learning. It contains a number of ML algorithm implementations across classification, regression, ensemble, loss functions, and feature engineering. In particular, Stochastic Gradient Descent, Min-Max Normalization, and Passive Aggressive Regression (among many others) are included. Hivemall targets Hive 0.9 or later and is licensed under the LGPL.

https://github.com/myui/hivemall

Hortonworks has announced a beta of the Ambari SCOM Management Pack add-on to HDP for Windows. Ambari provides lifecyle management and monitoring for the Hadoop stack, while Microsoft's System Center Operations Manager (SCOM) is a popular Windows deployment and monitoring system. Using Ambari's REST API to retrieve information about the cluster, the Ambari Management Pack exposes information and metrics about a Hadoop cluster within System Center.

http://hortonworks.com/blog/integrating-hadoop-operations-with-microsoft-system-center/

BMC, makers of the Control-M workflow automation system, have announced support for Hadoop within Control-M. The system supports Pig, Hive, MapReduce, Sqoop, and more in addition to all of the enterprise systems that Control-M has traditionally integrated with. Control-M offers some advanced features not typically seen in open-source Hadoop frameworks such as wide-ranging system integration and estimation of runtime based upon historic runs and the current state of the system.

http://news.techworld.com/applications/3471308/bmc-puts-hadoop-on-the-company-timesheet/

Events

Curated by Mortar Data ( http://www.mortardata.com )

Monday, October 7

Spark overview and PySpark demo (San Francisco, CA)
http://www.meetup.com/San-Francisco-PyData/events/142107482/

Tuesday, October 8

Discuss Twitter's use of Apache Mesos and Apache Aurora in Production (McClean, VA)
http://www.meetup.com/bigdatadc/events/143187322/

Wednesday, October 9

October SF Hadoop Meetup (San Francisco, CA)
http://www.meetup.com/hadoopsf/events/141368262/

Hadoop for Data Science (Laurel, MD)
http://www.meetup.com/Data-Science-MD/events/139054402/

Scalding the Crunchy Pig for Cascading into the Hive: Popular MapReduce tools (New York, NY)
http://www.meetup.com/Hadoop-NYC/events/142038762/

Thursday, October 10

Making R play well with Hadoop - David Champagne & Antonio Piccolboni-Revolution (Los Angeles, CA)
http://www.meetup.com/LA-HUG/events/140572752/

Friday, October 11

Algorithms for MapReduce - Jeffrey D. Ullman, Stanford University (Warsaw, Poland)
http://www.meetup.com/warsaw-hug/events/141413222/

Data Eng Weekly