Data Eng Weekly

Hadoop Weekly Issue #60

09 March 2014

Hadoop Weekly surpassed 2,000 subscribers this week, and I’d like to mark the occasion by again thanking everyone who writes fantastic articles which make this newsletter possible. This issue is full of really interesting technical articles including a number that discuss the ever-expanding Hadoop ecosystem. Enjoy!


The Hortonworks blog has a post describing an integration between ElasticSearch, Flume, and Hadoop. The post includes the technical details for deploying the system, including Kibana which is an open-source UI for timestamped data stored in ElasticSearch. In addition to inserting data into ElasticSearch via Flume, the post includes information on the MapReduce-based libraries for batch indexing data for ElasticSearch.

Matei Zaharia, creator of Spark and CTO of Databricks, recently spoke at StrataConf about how companies are using Spark. Datanami has a recap of the presentation (which is also short and worth watching itself). It recaps how Yahoo is using Spark for homepage personalization and advertising analytics, how Conviva uses Spark for optimizing video delivery QoS, and how ClearStory uses Spark as an underpinning of their interactive product.

Debugging Hadoop can be a dubious task, especially if you don’t know where to start. It’s always useful to learn from other folks’ experience and debug processes. The Altiscale blog has a post about debugging a JobTracker performance issue after a DistCp job. The post walks through examining various logs and metrics as well as details on the eventual solution to the problem.

Checkpointing is the process of combining the NameNode’s fsimage and edit log file to build a new fsimage. The Cloduera blog has an overview of checkpointing in both a HA NameNode as well as a traditional NameNode / SecondaryNameNode architecture. The post describes the motivation and implementation in great detail. This information is important and valuable for anyone that’s operating an HDFS cluster.

Another post on the Cloudera blog talks about machine learning at scale. It highlights the disconnect between exploratory and small-scale analytics with R and Python vs. scaling up a problem to run via MapReduce. Apache Spark offers a lot of features to bridge this gap, such as a REPL, in-memory analytics, and large-scale distributed computation. In addition to exploring these concepts in detail, the post walks through using Spark to build recommendations on Stack Overflow data using alternating least squares.

A new tutorial for the Hortonworks Sandbox demonstrates how to read from and write to HBase using the Java APIs. The tutorial provides an example implementation for storing information about doctors in an HBase table. The code is a good starting point if you’re new to HBase (although in a production system you’d most likely want to use a serialization framework like avro, thrift or protobuf to serialize data). It also includes some sample code to build a pie chart using the JFreeChart library.

In the second in a series on the Apache HBase BlockCache, HBase committer Nick Dimiduk has done a performance comparison of BlockCache implementations under a random-read load. The sample code and an in-depth analysis have been posted, and there are a number of interesting conclusions. Specifically, if your dataset fits in RAM LruBlockCache is the best, although the offheap and file configurations seem to be the best as heap size grows and HBase experiences cache churn. The full experiment and conclusions, which are a bit more nuanced, are well-worth the read.


Several vendors have highlighted their scores in a recent analyst report by Forrester, but I haven’t included the content because the report itself is not free. But CMSWire has done a good job summarizing the report, which covers all of the major vendors in categories like current offering, strategy, and market presence.

Techopedia has a story about some of the challenges facing the Hadoop community and vendors as Hadoop becomes more mainstream. For example, it discusses the conflicts that result from rival vendors employing committers to the same projects, and the hurdles still facing customers getting started with Hadoop. The article frames these and other points in the context of open-source (Hadoop) vs. proprietary (Oracle and other established vendors).


Cloudera Impala 1.2.4 was released. The new version includes bug fixes and performance enhancements for startup times and metadata refresh. The release is compatible with CDH4 and requires Cloudera Manager 3.8.

Mortar Data has updated their Pig/Hadoop as a Service offering to include enhanced troubleshooting. As anyone who has debugged a MapReduce job or a workflow is familiar, digging through the web UI and log files to find the root-cause is a cumbersome process. The new Mortar UI highlights the jobs, tasks, and failed tasks for a pig job including information about which pig aliases are associated with each MR job. Hopefully additional services (and Hadoop itself) can learn from these types of UX improvements.

Apache Accumulo 1.5.1 was released. This version includes a number of bug fixes and performance improvements. In addition, this release adds compatibility with Apache Hadoop 2.2+ (in addition to Hadoop 1.x).


Curated by Mortar Data ( )



HBase Meetup @ (San Francisco) - Wednesday, March 12

Special Event: Hadoop Developer Day -- Let's get hands-on with Hadoop (San Jose) - Thursday, March 13

Pig User Meetup (Sunnyvale) - Friday, March 14

Big Data Science Meetup Event (Fremont) - Saturday, March 15

Washington State

Seattle Spark Meetup Kick Off with Databricks (Bellevue) - Thursday, March 13


Houston Hadoop Meetup Series (Houston) - Wednesday, March 12

Advanced Hadoop Based Machine Learning (Austin) - Wednesday, March 12


Trends in Big Data (Mason) - Tuesday, March 11


Next-Gen Cloud Computing and DevOps with Docker Containers (Arlington) - Wednesday, March 12


The Art and Application of Hadoop (Atlanta) - Tuesday, March 11


Vermont Hadoop Meetup (Burlington) - Wednesday, March 12


Big Data Barcelona (Barcelona) - Monday, March 10


Hadoop Fundamentals I (Zurich) - Wednesday, March 12


Lyon Hadoop Meetup #2 (Lyon) - Thursday, March 13