Data Eng Weekly

Hadoop Weekly Issue #74

15 June 2014

With Hadoop Summit in recent memory, there are several posts from or summarizing the summit in this week’s newsletter. Technical articles cover a wide range of topics from Hive and Pig tips to logging infrastructure at Loggly. SQL-on-Hadoop was also a big topic this week—discussions about the need for it to drive Hadoop adoption.


The Mortar blog has a post with some tips for using Apache Pig. It features some lesser-known features of Pig such as writing UDFs in JavaScript, data sampling, and casting a relation to a scalar. If you use Pig and are looking to level-up your game, this is a great place to start.

HDFS RAID is a mechanism to use erasure codes instead of replicas in HDFS. Glossing over the technical details (which are covered in this article), you can do 2.2x or 1.4x replication instead of 3x, which makes for huge savings on large clusters. Facebook has posted about their experience deploying HDFS RAID to save petabytes of storage. There are a lot of tips and details on problems they faced in the road to reclaiming lots of storage space.

Loggly, who makes a log management service, has written about their usage of Apache Kafka. Kafka let them simplify their deployment, which lets them process hundres of thousands of events per second. They also talk about some of the technical details and operational concerns of their deployment (such as what machines they use on AWS and how they control resource utilization).

Apache Flink (incubating), formerly known as stratosphere, is a next generation processing framework with similar goals to other frameworks like Apache Tez and Apache Spark. This post explains the philosophy and design behind Flink, which is heavily influenced by relational database optimizers. Essentially, Flink will try to rearrange or rewrite the pipeline you've described in order to improve performance based on statistics and other knowledge of the underlying data.

There are quite a few options for running SQL queries against data stored in Hadoop (HDFS, HBase, or API-compatible File Systems). This post covers a number of them—Apache Hive, Impala, Presto, Shark, Apache Drill, Pivotal HAWQ, IBM BigSQL, Apache Pheonix, and Apache Tajo. For each one, there’s an overview of the tool and recommendation for when to use it.

The MapR blog has a tutorial on deploying Apache Accumulo 1.5 on MapR 3.1. The tutorial walks through the various MapR FileSystem and Accumulo configuration settings.

The Hortonworks blog has a post on using Cascading to build a flow for parsing log files, grouping by IP, and generating counts per IP. The post has the code and a full walkthrough of how the code works.

The Apache Accumulo summit was this week, and there were a number of great presentations. This one on scaling Accumulo clusters has lots of details on its under-pinnings, which help it support large datasets at high throughputs.

The SF Data Mining meet up recently featured a presentation entitled "Mining Big Data for Apache Spark." Hakka Labs has a video of the presentation, which features the MLLib library from Spark and a live demo of the tools.

This post shows how to use Apache Spark to classify the Reuters 1987 dataset. The code for the tutorial is written in Scala and features XML parsing (using SAX), stemming/tokenization using Lucene, computing TF-IDF, and building a naive bayes model. The code for the example is on github, and there are instructions for building the example in the post.

The Cloudera blog has a post on a rolling upgrades, which is a feature of Cloudera Manager since version 4.6. While most native packages like RPMs and debs don’t allow the simultaneous install of multiple versions of a package, Cloudera Manager can distribute binaries as ‘parcels.’ This, along with the highly available NameNode, facilitate rolling restarts. The Cloudera blog has more details on the process.

Hadoop Internals has a number of details on various parts of Hadoop. It covers Hadoop architecture, the anatomy of a MapReduce job, the various daemons in a Hadoop cluster, a list of key configuration parameters (what they affect), and more.

This post has five tips for working with Hive. They cover two important configuration parameters, a tip on writing queries, and two builtin UDFS—percentile_approx() and histogram_numeric(). There are several example queries illustrating the tips.


The Hortonworks blog has a post with some key takeaways of Hadoop Summit. They include, momentum (highlighting the number of attendees at the summit), the rise of YARN and all the tooling around it, and enterprise Hadoop.

Datanami has a post on Hadoop-as-a-service and hosted Hadoop, which seem to be gaining steam. The post includes interviews with Qubole and Altiscale. There also some numbers from these and other companies showing that managed Hadoop is gaining a lot of steam.

Big Data and Brews has a conversation with Ovum’s Tony Baer about SQL and Hadoop. The conversation, for which there is a both a video and a transcript, contains a lot of interesting points about scaling Hadoop within an organization and across many enterprises. This is where SQL comes in, because many BI tools and applications (which are driving forces for scaling Hadoop) expect to pull back data via SQL.

ScalingData is a new company founded by several Cloudera veterans to build tools using Hadoop to help companies with IT operations. This week, they announced $4.4M in funding to build their platform.

Forbes has a contributor article from SilconAngle founder John Furrier on SQL, Open Source, and Security on Hadoop. The piece highlights some of the recent advancements and new tools in the SQL-on-Hadoop market, the oft-discussed spectrum of open-source strategies for Hadoop vendors, and the role of security in enterprise adoption (which recently picked up steam with acquisitions by Hortonworks and Cloudera).


Spring for Apache Hadoop 2.0 reached GA this week. The new release includes support for a number of distributions, including Apache Hadoop 1.x/2.2/2.4, Pivotal HD 2.0, Cloudera CDH 5, and Hortonworks 2.1. Spring for Hadoop has tools for developing YARN applications, abstractions for reading from/writing to HDFS, and POJO support for Hadoop datasets using the Kite SDK.

Cloudera announced Cloudera Enterprise 5.0.2 (which includes CDH 5.0.2 and Cloudera Manager 5.0.2). The new release of CM includes a fix for Impala query monitoring and CDH includes fixes for Hadoop, HBase, HDFS, Hive, Pig, and YARN.

Hortonworks announced HDP Security, which includes some new features as a result of their XA Secure acquisition. The new features include a centralized security tool, fine grained access control for HBase, Hive, and HDFS, and audit logging.

Continuuity Loom 0.9.7 was released this week. Loom is a cluster provisioning and management suite for private and public clouds. The new release includes a number of changes, including cluster reconfiguration and service addition. There are more details of the release on the Continuuity blog.


Curated by Mortar Data ( )



Productionizing Spark Streaming, Tableau Spatial Queries, Spark Search Indexing (Mountain View) - Tuesday, June 17

45th Bay Area Hadoop User Group (HUG) Monthly Meetup (Sunnyvale) - Wednesday, June 18

Washington State

Scalable Analytics with R and Hadoop (Seattle) - Monday, June 18


Big Data Technologies - Apache Spark with MapR (Portland) - Wednesday, June 18


UHUG - Can a Pig Wear Lipstick? (Salt Lake City) - Wednesday, June 18


Leverage what you already know with BigSQL 3.0 on Hadoop (Scottsdale) - Wednesday, June 18


Hortonworks Educational Workshop (Fort Worth) - Thursday, June 19


St. Louis Hadoop Users Group Meetup (Saint Louis) - Tuesday, June 17


Hello Hadoop, meet Apache Spark (Chicago) - Wednesday, June 18

North Carolina

SQL for Hadoop (Durham) - Monday, June 16

This Ain't Your Father's Search Engine (Durham) - Thursday, June 19

First meeting of the Triad Hadoop Users Group (Winston Salem) - Thursday, June 19

New Jersey

Princeton Tech Meetup w/ Gilt Groupe (Princeton) - Wednesday, June 18

New York

YARN Tech Talk: The Data Operating System for Hadoop 2.0 (New York) - Tuesday, June 17


June Hadoop Meetup (London) - Tuesday, June 17


R & Hadoop (Singapore) - Wednesday, June 18


Let’s Discuss Hortonworks Bigdata, Its Significance, Future & Training (Bangalore) - Saturday, June 21

Hadoop Ecosystem (Hyderabad) - Saturday, June 21


Bigdataeverywhere Conference - MAPR & VERTICA (Herzeliyya) - Sunday, June 22