Data Eng Weekly

Hadoop Weekly Issue #124

08 June 2015

Hadoop Summit is this week in San Jose, so I expect there to be a lot of great technical content and lots of announcements. If you see something interesting, please send it my way for next week's issue. For those that aren't making the conference, there are many great articles this week to hold you over.


The Cask blog has a post describing how the Cask Data Application Platform uses Apache HBase as a queue for their real-time stream processing framework. The post details how data is partitioned across the key space in HBase, how HBase and Tephra are used to provide exactly-once delivery semantics, how producers and consumers interact with the queue (including during a rebalance), and performance numbers (they see 100k events/sec on a ten-node HBase cluster).

This post on the Cloudera blog describes some common patterns and solutions for near real-time processing. First, it describes a Flume and Kafka-based architecture for streaming ingestion and some light-weight event processing (e.g. adding some attributes to events as they flow in). Second, it looks at using Spark streaming with Kafka to perform more complex computations like sessionization and building ML models.

The upcoming Spark 1.4 release will add a number of useful features and functions for working with DataFrames. These include random data generation, summary statistics, cross tabulation, and more. The Databricks blog has an overview of these and other key new features.

The Hortonworks blog has the first in a series of posts on Apache Zepplin (incubating), which is a web-based notebook for Spark, Hive, and more. This post describes how to build, run, and configure Zepplin (there are a few HDP-specific instructions, but they should mostly work across distros). It also has a brief examples of using Spark, Spark SQL, and the built-in visualization directives.

Version 1.99.5 of Sqoop2 introduce integration with the Kite SDK. For folks using Kite, this will greatly simplify the process of loading (or exporting) data from/to an RDBMS. The blog has a post describing how to configure and execute the integration.

Twitter has written about their next-generation stream processing framework. Called Heron, the system is API compatible with Apache Storm (which was also developed at Twitter before open-sourcing) but provides better scalability, throughput, and congestion/load spike controls. The Twitter blog has an overview of the system and contains a link to a research paper presented at the SIGMOD '15 conference.

Metamarkets have written about how they use Druid, Apache Samza, and Apache Kafka both as a platform for real-time data and for their monitoring stack. Each component of the stack periodically emits timestamped and tagged metrics to a separate metrics cluster, which allows both for dashboards and exploratory analytics.

CDH 5.4 and Cloudera Manager 5.4 have support for redacting sensitive information from log files (across all daemons) and sources of queries (Impala/Hive/HUE). The Cloudera blog has an overview of how to configure this new feature via Cloudera Manager.

This post describes how Taboola monitors their Spark cluster, which processes 10B+ user events daily. The article has a number of details about their deploy, such as details on their Spark nodes and how they monitor metrics and log files. It also describes the common types of problems that they see (both application and machine-layer).

The dataArtisans blog has a guest post about how Bouygues Telecom uses Apache Flink with Apache Kafka to do streaming transformation of data. Using Apache Camel, raw data is loaded into Kafka. Flink consumes the data, enriches it, and feeds it back to Kafka. Logic was also added to the Flink streaming job to monitor the flow of data and fire alerts.

KDnuggets has an interview with James Taylor of Salesforce and the Apache Phoenix project, which provides SQL on HBase. Topics covered include the motivation and creation of Phoenix, comparing HBase's data model with Phoenix, some of Phoenix's optimizations, and plans in upcoming releases (including support for transactions).

If you haven't yet taken a serious look at Apache Spark (or don't know what it is), this post gives a good overview of Spark. It describes the fundamental data structures, describes the shell, and gives an overview of the included APIs.

"Scalability! But at what COST?" is a paper that looks for the Configuration that Outperforms a Single Thread. General-purpose, distributed compute engines introduce overhead in coordination and typically require a particular type of algorithm (e.g. a vertex-centric model in many graph libraries), which often require extra work. This post includes some graphs which show how much faster a single-threaded model can be—it outperforms most systems with 128 threads for page rank and connected components.

There are some subtleties to how checksums work for files created by HBase in conjunction with HDFS's checksums. This post describes the types of checksums, how they're used, and how to optimize the computation by using native code. The post also shows the effects of switching to native checksums by including flame graph figures.


King, makers of Candy Crush and other games, use Hadoop and EXASOL (an in-memory analytics db) to analyze the player experience. This post describes a few ways that the setup has brought value to the company—giving loyal players better customer support and discovering which levels were too difficult. The post describes why the added an in-memory db to their analytics infrastructure.

The Platform has an interview with Concurrent CEO Gray Nakamura about the Hadoop adoption in the enterprise. Concurrent has an interesting perspective on things since they have clients running Cascading across many different distributions. Of note, many companies keep Hadoop in an experimental phase for up to two years before rolling it out.

There are two new free massive open online courses on Apache Spark on the edX website. The first, "Introduction to Big Data with Apache Spark," started last Monday. The second, "Scalable Machine Learning," starts on June 29th.

The WSJ blog has a post about how Under Armour/MyFitnessPal use Apache Spark to analyze data. The post describes, at a high-level, how they've built the Verified Foods feature by normalizing and aggregating calorie and nutrition information. They are also using Spark for BI and to power a recommendation engine.

This week's O'Reilly Data Show Podcast has an interview Databricks co-founder Patrick Wendell. There's a partial transcript of the show online in which Patrick describes several industry usages: Spark SQL at Microsoft, MLlib at Airbnb, and MLlib at Huawei. There is also a discussion of features in upcoming Spark releases.

The blog has an interview with Subash D'Souza, who is co-organizer of the Big Data Day LA conference on June 27th. The interview notes that many companies in LA have adopted big data technologies.


Apache Flume 1.6.0 was recently released. Highlights include a Kafka source and sink, a Kafka channel, a Hive Sink, end-to-end authentication, and a regex-based find/replace interceptor.

Apache Slider 0.80.0 was released last week. The Hortonworks blog has a summary of the key new features in the release. These include support for containerized applications, zero-downtime upgrades, and more.

Apache Oozie 4.2.0 was released. The new version includes support for HiveServer2 and Spark. It also improves the re-run actions for workflows and coordinator actions.

The LinkedIn team has open-sourced Burrow, a system for monitoring consumer lag in a Kafka cluster. Burrow automatically finds all consumers using Kafka for committing offsets, provides status checks via HTTP, and can trigger email alerts. Burrow is written in Go and the code and documentation are on github.

Implaya, which provides python bindings to Cloudera Impala, version 0.10.0 was released. The new version has a number of new features, including python 3 support for DB API, Kerberos support, SQLAlchemy type compiler, bug fixes, and more.

Qubole, who provides a Hadoop-as-a-Service offering in the cloud, has announced support for Apache Hadoop 2.6.0 (the latest stable release).

Apache Pig 0.15.0 was released. It includes many improvements to the Pig on Tez implementation (stabilization and improved parallelism) and support for running Hive UDFS from Pig.


Curated by Datadog ( )



Soar Higher with HAWQ: MPP SQL & Advanced Analytics on Hadoop from Pivotal (San Jose) - Monday, June 8

Hadoop Summit Meetup: Apache Atlas (San Jose) - Monday, June 8

Pre-Hadoop Summit Meetup: Apache ORC and Apache Ambari (San Jose) - Monday, June 8

Big Data & The Open Data Platform Initiative (San Jose) - Monday, June 8

Anomaly Detection and NLP Using Hadoop (San Jose) - Monday, June 8

Machine Learning Pipelines, Spark Packages, and Getting to Production (San Jose) - Monday, June 8

Hadoop Summit Birds of Feather Sessions (San Jose) - Thursday, June 11 Topics: Zeppelin, Data Science & Spark; Women in Hadoop; Accumulo; YARN; HDFS; Hive/Tez/Pig; Ranger/Knox; Atlas/Falcon; Hadoop Engineering Best Practices

Getting Started with Apache Cassandra & Spark (Newport Beach) - Thursday, June 11


Introduction to Streaming Distributed Processing Using Storm (Bellevue) - Wednesday, June 10


Learn about NiFi (Houston) - Tuesday, June 9


Cloudera Search, Hadoop and SQL! (Grand Rapids) - Wednesday, June 10

District of Columbia

Visualizing Data in a Hadoop Environment (Washington) - Thursday, June 11


The Three Dimensions of Scalable Machine Learning (Cambridge) - Monday, June 8


Spark Meetup with Cloudera, Xebia, and Influans (Paris) - Thursday, June 11


Data Science at Scale and Graph Analysis with Dean Wampler and Paco Nathan (Amsterdam) - Monday, June 8


First Apache Mesos Meetup (Hamburg) - Tuesday, June 9


Tachyon Meetup (Herzliya Pituach) - Monday, June 8