Data Eng Weekly

Hadoop Weekly Issue #110

01 March 2015

There were a lot of release this week, including the first project out of Confluent (version 1.0 of the Confluent Platform), version 1.0 of the Kite SDK, and version 1.0 of Apache HBase. And as has been the trend over the past 6 months, there's a lot of coverage of Apache Spark and Apache Kafka in this week's issue, including some details on Databricks (the company founded by Spark's creators).


Apache Flink is introducing a new feature called Flink Streaming, which is a real-time stream processing library. Flink Streaming is currently available on master but is not yet part of a release. Flink Streaming doesn't require microbatching, which allows advanced windowing techniques like sliding windows and data-driven windows. The introductory blog post has several examples of the API.

This article describes how to deploy a flume master and agents to stream data from a file on the local filesystem to HDFS.

This post describes the notion of "moving the code to the data" for big data processing and how that applies to Apache Spark. It discusses Sparks' Resilient Distributed Datasets (RDD), how programs are translated to operations on RDDs (at a high level), data shuffling, execution planning, and fault tolerance.

A recent release of the Kite SDK added support for data stored as json-encoded text. This post describes how to automatically extract a schema from json data (based off of the first 20 records in a dataset) and import json data into Hive via the json-input command.

Spark was a hot topic at last week's Strata+Hadoop World. The team from Databricks gave a number of presentations on topics like Spark Streaming, debugging and tuning Spark, and the future of Spark. All of the slides from these presentations have been posted on slideshare (linked to from the following post).

This post describes five ways to make Apache Hive queries faster. The recommendations are mostly around new features that are disabled by default or that require additional processing. These include: enabling Tez, writing data using ORCFile, enabling vectorization, enabling the cost-based optimizer (and computing statistics), and some tips for writing good SQL for Hive.

Flume has a number of plugins for interacting with Kafka. Along with the morphlines library, which can be used to parse and process data, one can build powerful data pipelines with Flume and Kafka without writing any code. This post on the Cloudera blog describes how to build one such integration: Apache httpd log data is sent to syslog, which is forward to Flume then Kafka then to Solr (via Flume/Morphlines).

This post covers rolling upgrades in YARN. Once side-by-side installations of multiple versions of YARN are available on all nodes, the Job History & Timeline Service, ResourceManagers, and NodeManagers can be restarted with the new version in a rolling fashion. Any YARN applications should survive this process, although there are some best practices to build compatible YARN applications (e.g. using the distributed cache rather than local files for config). There are many more details about YARN HA in the following post from the Hortonworks blog.

This post describes how to configure Spark to emit metrics to Graphite. There are some examples of the types of data you can collect using these metrics, and a scripts/template for working with Grafana (web front-end to Graphite) dashboards for Spark.

The Amazon Big Data blog has a post on using IPython to run a python (streaming) MapReduce job on Amazon Elastic MapReduce. The instructions are general purpose, but it's quite easy to get started with IPython on EMR with the ipython-notebook bootstrap action.

Cloudera has announced a beta version of Hive-on-Spark that includes all major functionality. For folks anxious to try it out, there is a custom CDH parcel, which supports a subset of the regular CDH components.

The Confluent blog has an excellent two-part series on building a streaming data platform with Apache Kafka. The first post is foundational—it describes a stream-centric data architecture, events and event streams, the requirements of a stream data platform, and more. The second post is more practical—it contains advice for building a Kafka-based platform, covering topics such as Kafka topologies, data formats and schemas, various types of data (event streams, application logs, machine metrics, etc), and stream processing.


Datanami has an article on the Open Data Platform (ODP) announcement from last week. They've collected a number of opinions from folks in industry. In particular, independent software vendors (ISVs) seem to think the goal of standardizing is important in order to ease the pain of integrating with the plethora of distributions. But the ISVs are unsure that the ODP should be the organization responsible for this—the Apache Software Foundation could also potentially provide the standard.

VentureBeat has coverage of Hortonworks' first quarterly report since becoming a public company. In short, they announced $12.7 in revenue for Q'14 and a net-loss of $2.19/share (analysts had expected $2.04/share). The good news is that Hortonworks saw billings grow 148% year-over-year after adding 99 new customers in Q4.

ReadWrite recently spoke with Hadoop creator Doug Cutting about the growing Hadoop ecosystem (comparing Hadoop to Linux), Hadoop's warts ("It used to be a lot worse [..] It's gotten a lot better"), Hadoop's limitations (enterprise features and integrations), and how to get started with Hadoop (usually a new project lowers the barrier to entry).

Qubole has rounded up a number of articles from last week, including several on the Open Data Platform. In addition, there are a few articles based on happenings at Strata, the Hadoop skills gap, and more.

The DBMS2 blog has a post on Databricks and Spark. The post covers several topics, including Databricks Cloud, which is the main product offering from Databricks. Databricks Cloud runs independent of Hadoop on Amazon Web Services, and it is not yet generally available. The post also discusses the company (which is over 50 people), Spark DataFrame, SparkR, and Spark Streaming.


Apache HBase released version 1.0 this week. The 1.0 version numbering marks the significant feature-set and stability in the release. Since version 0.98, there are several new features like: timeline consistent read replicas, online configuration changes, revamped documentation, and improvements to APIs. The Apache Software Foundation blog has a press release and a technical post on the release.

The Kite SDK has announced the 1.0.0 release, which promises API stability in coming versions. The new release has also removed all deprecated classes/methods, improves avro schema handling, and contains some other minor changes.!topic/cdk-dev/iszgEzhLFYA

Google has announced a new version of bdutil, their tool for running Hadoop clusters on Google Cloud Platform. The new release includes bug fixes, improvements for Spark, new support for Apache Flink, improvements to bdutil's Ambari plugin, a plugin for Apache Storm, a basic HBase plugin, and a basic CDH plugin.!topic/gcp-hadoop-announce/uVJ_6y9cGKM

Cloudera has released Cloudera Enterprise 5.2.4 and 5.3.2. The 5.2.4 release fixes an edit log corruption issue in HDFS and issues with Search, Impala, Hive, Hue, and Cloudera Navigator. The 5.3.2 release includes a fix for edit log corruption as well as fixes for nearly every component in CDH. Cloudera manager also has several fixes.

Confluent, makers of a stream data platform powered by Apache Kafka, announced version 1.0 of the Confluent Platform. Components of the platform include: APache Kafka (with a few patches), a schema management layer, a REST API for Kafka, and Camus for sending data from Kafka to HDFS.


Curated by Datadog ( )



Bayesian Networks with R and Hadoop (San Francisco) - Monday, March 2

Cassandra & Hadoop Explained in Real-World Applications (Palo Alto) - Tuesday, March 3

Samza Meetup Kickoff (Mountain View) - Wednesday, March 4

Long-Running Services on YARN Using Slider (Santa Clara) - Wednesday, March 4

What Makes Machine Learning Easy to Program, and What Makes It Fast? (Palo Alto) - Friday, March 6


Cassandra at Instagram, Twitter, and Facebook: Industry Best Practices (Minneapolis) - Monday, March 2


R Intro and YARN: A Game Changer (Oklahoma City) - Thursday, March 5


First Spark Meetup: Bringing This Group Alive (Alpharetta) - Thursday, March 5

New York

Apache Ignite: Introducing the Future of Fast Data (New York) - Tuesday, March 3


Big Data, The Next Generation: Hands-on with Hadoop YARN, Impala, and Spark (Cambridge) - Wednesday, March 4

Spark-as-a-Service Demo by Qubole, and Big Data Applications by DataXu (Boston) - Thursday, March 5

ENGLAND Spark London Meetup (London) - Thursday, March 5


End-to-End Twitter Analysis with Hadoop and Watson in 10 Minutes (Stuttgart) - Friday, March 6


Cassandra Spark Connector Architecture and API (Rotterdam) - Thursday, March 5


Lessons Learned Building a Spark Distribution (Lausanne) - Tuesday, March 3


Introduction to Apache Spark + Error Handling in Scala (Athens) - Wednesday, March 4


Adam Drake of Skyscanner and Jeff Markham of Hortonworks (Singapore) - Wednesday, March 4