Data Eng Weekly

Hadoop Weekly Issue #99

14 December 2014

It was quite a busy week for funding news with the Hortonworks IPO and VC funding for Altiscale and Qubole. There were also quite a few new releases (Oozie, Phoenix, Samza) as well as several posts describing recent releases in more detail. And given that we’re halfway into December, the first of 2014-retrospective and 2015-predictions for Hadoop are in this week’s issue.


Kafka is rapidly becoming a key component of the Hadoop ecosystem and integrating with more-and-more systems. As more use-cases arise, folks are looking to use Kafka in new ways. This post looks at how to model a few different types of data processing jobs with Kafka to ensure exactly-once delivery. It's not a comprehensive solution, but it introduces several key concepts to help think about avoiding duplication.

This post provides several resources if you're looking to get started with Kafka. In addition to the official docs, it suggests several blog posts.

This article about Kafka (last one, I promise!) has a recap of the O'Reilly Data Show Podcast. The conversation with Jay Kreps covers the origins of Kafka at LinkedIn and Kafka's surprising popularity.

The Hortonworks blog has the second post in a series of articles on data science with Hadoop. The first post looked at predicting airline delays using Pig and Python, and this time they do the same exercise with Spark and Spark’s matching learning library, MLlib. There’s an iPython notebook walking through the code, which does some data preparation and munging before evaluating Logistic Regression, Support Vector Machines, and Decision Trees.

The Cloudera blog has a post describing several new Apache HBase features related to multi-tenancy that have shipped as part of CDH 5.2. These include throttling via quotas and execution queues (e.g. separate queues for read/write and de-prioritizing long-running scans). There’s also a discussion of future work to further improve multi-tenancy.

Apache Knox Gateway is a perimeter security system for Hadoop. The latest release, 0.5.0 (from early November) includes support for HDFS HA, installation via Ambari, authorization via Apache Ranger, and YARN REST API access. The Hortonworks blog has more details on these features and future plans for Knox.

The Hortonworks blog also has a post about the recently released Apache Ranger 0.4.0. Ranger is a system for enterprise security in Hadoop, which is based on the technology that Hortonworks acquired as part of XA Secure. Highlights of the new release include support for Apache Storm and Knox, REST APIs for the policy manager, and storage of audit events in HDFS.

This post looks at the HDFS protocol by way of monitoring system calls made when fetching a file. By running the python snakebite library under strace, one can see the contents of protobuf RPCs to the namenode and datanode to locate and fetch the data.

Although Spark is written in Scala, a lot of folks are using it via Java (which is a far more widely used language). This post looks at the benefits in syntax and improvements in usability Java users will see as part of Java 8’s lambdas. There’s an analysis of implementing collaborative filtering in Spark with Java 7 and Java 8 (the latter is about half the lines of code). You can find the actual code on the companion github repo.

Sqoop 1.99.4 was recently released with a number of new features. The blog has a tour of the architectural changes, the new intermediate data format, and command line tools. It also covers improvements to batch execution of shell commands, the rest API, and filesystem URI configuration.

The Cloudera Impala team has written a guide with a number of tips and best practices for building an Impala cluster. The tips are shared as a presentation, and they cover things like schema design, hardware recommendations, and multi-tenancy best practices.

Hortonworks HDP 2.2 shipped earlier this month with features form the Apache Hive 0.14 release. This post on the Hortonworks blog covers a number of common questions about the ACID transaction support added in 0.14. Specifically, it discusses the expected number of concurrent transactions, transaction throughput, and overhead. It also mentions example use cases and some in-progress features that’ll be added soon.

The Amazon Web Services blog has a post describing how to load data from Amazon Kinesis into HBase. The post describes how to use the AWS Java SDK to start an Amazon EMR HBase cluster, create a HBase table that’ll hold the data from Kinesis, implement a Kinesis connector pipeline (full source code on github), and run the Kinesis connector to load data.


As 2014 comes to an end, I expect to see lots of Hadoop year-in-review posts. This one looks at five big data industry trends from 2014. They are: vendors integrating SQL with Hadoop, maturing platforms (security and other enterprise features), new educational options including MOOCs, lots of new cloud options, and Spark (which seems to be integrating with just about every platform).

The AWS Activate blog has a post on Hadoop startup Mortar Data, which is built on the AWS cloud (disclosure: Mortar syndicates Hadoop Weekly and curates the events section). The post talks about their platform, which is built with Amazon EMR, Apache Pig, and the Luigi workflow engine.

Folks from Cloudera, MapR, and Twitter have written about the state of the community of the Apache Parquet (incubation) project. Parquet is a columnar storage format for Hadoop which is integrated with a number of different systems. They credit the community and the ubiquity of Parquet (thanks in part to the community for building many integrations) with making Parquet a successful project.

Hadoop as a Service (HaaS) vendor Altiscale has announced $30 million in Series B funding (bringing total funding to $42 million). Unlike many other HaaS vendors, Altiscale has their own hardware running in datacenters on the east and west coasts of the US. WSJ has more details on the recent funding, the company, and how the company plans to use the new money.

Another HaaS vendor, Qubole, also raised a series B round this week. Qubole, who was founded by two creators of Apache Hive, raised $13 million (total funding is now $20 million). Qubole's offering includes an optimized version of Hive for the cloud, supports the Presto SQL-on-Hadoop engine, and more. GigaOm has more details on the round and the company.

The Qubole blog, in a post unrelated to their funding, has curated a list of several industry articles covering predictions and expectations for Hadoop in 2015.

It’s always interesting to hear how Hadoop is being used outside of the industries that made it popular (such as ads and search). This post describes the problems that were solved by introducing Hadoop at a hospital to analyze historic patient monitoring data.

The DMBS2 blog has three posts related to Hadoop. The first covers the future of Hadoop (especially highlighting some architectural issues with HDFS which the Tachyon distributed file system could solve) as “the new OS.” The second offers analysis of some news out of MapR this week on their customer growth and numbers. The last article includes notes covering a wide range of topics from Wibidata to Scaling Data to Hortonworks.

Hortonworks became a publicly traded company on Friday on NASDAQ under the ticker HDP. The IPO raised $100 million, and the stock jumped as high as $24.35 on the first day of trading (from an initial price of $16). CNBC has more coverage of the IPO.


Apache Oozie 4.1.0 was released this week. Highlights include a new Sharelib service, improvements to Oozie HA with secure clusters, support for sqoop via the oozie command, and many bug fixes.

Apache Samza (incubating) version 0.8.0 was released this week. Samza is a stream processing framework built on Apache Kafka and Apache YARN. Highlights of the release include: a new YARN AM UI, support for RocksDB for state management, and major performance improvements.

Apache Phoenix, the relational db layer for Apache HBase, released version 4.2.2 (for HBase 0.98.x) and 3.2.2 (for HBase 0.94.x) this week. The new releases include >100 bug fixes, statistics collection, correlated subqueries, and more.

Version 0.17.1 of the Kite SDK was released. This version has bug fixes and enhancements for kite data and morphlines.

MapR has added formal support for Apache Storm to their distribution. MapR is supporting version 0.9.3 of Storm, which is the most current release.

SequenceIQ has released a new version of Cloudbreak, their cloud-agnostic, docker-based Hadoop as a Service solution. This post describes one of the major new features of the release—autoscaling of Hadoop clusters via Periscope. Cloud break current supports AWS, Google Cloud, Azure, and OpenStack (in private beta). See the post for more details on Persicope’s Alarms, scaling policies, and scaling configurations.


Curated by Mortar Data ( )



Elastic Scaling in Spark 1.2 and Beyond (San Francisco) - Monday, December 15

From 0 to Streaming: Spark and Cassandra (San Francisco) - Wednesday, December 17

Interactive Session on Sparkling Water = Spark + H2O (Mountain View) - Wednesday, December 17

Meet Apache Spark: A Faster and More Flexible Compute Engine (Santa Barbara) - Thursday, December 18


Making Big Data Analytics Simple for Everyone: Datameer (Portland) - Thursday, December 18

Washington State

Getting Started with Pig and Hive on Hadoop in Azure HDInsight (Redmond) - Wednesday, December 17


Spark and Cassandra (Austin) - Wednesday, December 17


Introducing the R Language / About Hue (Saint Louis) - Tuesday, December 16


An Introduction to Hadoop for Big Data Analysis (Louisville) - Thursday, December 18

North Carolina

Hortonworks: How YARN Enables Multiple Data Processing Engines (Charlotte) - Wednesday, December 17

New York

What's New in Neo4j 2.2 (New York) - Monday, December 15

Data Driven NYC: Yann LeCun & Mike Olson (New York) - Tuesday, December 16

Building Data Pipelines Using Luigi, with Erik Bernhardsson of Spotify (New York) - Tuesday, December 16

BDW Meetup: Spark SQL (New York) - Wednesday, December 17

Scaling Apache Storm Pipeline and Building Real-Time Market Data Platform (New York) - Wednesday, December 17

Etsy on Migrating to Kafka in Three Short Years (New York) - Thursday, December 18

New Hampshire

Big Data Hadoop MapReduce Hands-On with Google Cloud (Nashua) - Thursday, December 18


Holiday Guest Speakers: Mark Grover & Reza Zadeh (Toronto) - Thursday, December 18


SQL on Hadoop: SQL+NoSQL in One Place (Stockholm) - Tuesday, December 16


NoSQL in Real-time Architectures (Herzliya) - Tuesday, December 16


BigData and Analytics: Why to Learn Hadoop (Hyderabad) - Wednesday, December 17