Data Eng Weekly

Hadoop Weekly Issue #104

18 January 2015

We have a lot of great technical posts this week covering security in Hadoop, recent improvements to Spark, HDFS’ new heterogenous disk support, and the relatively new Apache incubator project NiFi. In addition, Apache Flink graduated from the incubator, Etsy open-sourced a new tool for profiling Hadoop jobs, and much more.


An important feature for an enterprise storage system is data protection. This post looks at Hadoop’s data protection features including replication, checksums, HDFS snapshots, and distcp for cross-cluster replication. It also looks at some enterprise software from EMC that can integrate via Hadoop’s S3 API support, and the tools that come from major vendors. There’s also an overview of what protection there are for different types of events (data corruption, accidental deletion, etc).

Spotify has written another post about their personalization pipeline, which is powered by Kafka, Hadoop, and Cassandra. While the post from last week focussed on the recommendation-building layer, this post looks at the database layer, which is powered by Cassandra. The article describes things like schema design, cross-datacenter replication, and bulk transfer. If you’re using or thinking of using Cassandra in production, there are several highly-relevant details.

Apache NiFi (incubating) has support for a number of different data transfer and file formats. This post looks at its support for HTTP and Kafka—it walks through building a pipeline to download a zip-compressed file over HTTP, extract messages from files within the zip, and send those messages to Kafka.

Another post on Apache NiFi discusses using the system for ETL and gives a tour of the data provenance features. The post argues for combining data from multiple sources in order to simplify common operations against data, and it describes how you can track the source of data in such a setup using the builtin data provenance tools.

The eBay tech blog has a post on HDFS’ (relatively new) support for tiered storage. The post describes the feature, which allows a cluster to define different types of storage, and various storage policies that describe how to map replicas to tiers. eBay uses this feature to move much of their data to machines with a small amount of compute power but a large disk density (220TB each).

The Hortonworks blog also has a post on HDFS’ heterogeneous storage support. The post describes archival storage (similar to the eBay setup) as well as SSD storage and (available in technical preview) in-memory storage.

Spark has been criticized for not being as scalable as MapReduce on large clusters and datasets. Some folks from Cloudera and Intel have been working on improving the scalability and performance of one of the bottlenecks—the shuffle phase of Spark jobs. This post looks at those changes in detail and provides some performance comparisons of the new shuffle implementation.

The Databricks blog has a post on journaling for fault-tolerance in Spark streaming. The post motivates the feature (which is necessary when consuming data from systems like Kafka and Flume), describes the implementation (which uses a write-ahead log on a distributed file system like HDFS), and describes how to use and configure journaling for Spark streaming.

The AWS blog has a post in which BloomReach describes several techniques that they use to manage costs of Elastic MapReduce. Tips include using spot instances, shared clusters for smaller jobs, tags for tracking cost, and a simple algorithm for computing the best instance type given fluctuations in spot pricing.

The Cloudera blog has a post about setting up a Hadoop cluster. It covers key areas like networking, DNS, OS configuration, storage, and hardware specs. There are a few CDH-specific recommendations, but overall this is a great resources for anyone building a Hadoop cluster.

This is a quick post describing how to interface with Avro data from Spark. The post shows how to convert cvs data to Avro and then write a small Spark program in Scala for reading the data.

This post describes some of the tools that are popular when analyzing scientific data with Hadoop. A lot of popular tools use non-JVM languages (R, Python), which can make it difficult to interface with Hadoop. This post looks at the current state of support (and what can be improved) across three key areas - storing open data, web-based notebooks (popularized by IPython), and distributed data frames.


Apache Flink, a distributed data processing framework, graduated from the Apache incubator this week. Originally called stratosphere, the project has some similarities to Apache Spark (e.g. it aims to be faster at MapReduce for iterative algorithms).

Qubole’s weekly roundup of Hadoop happenings has coverage of ten industry articles. These include Hadoop for enterprise, big data for banking, and Hadoop at AutoDesk. There are also several industry prediction/retrospective articles.

Databricks and O’Reilly have announced a new online Spark Certified Developer exam. The exam takes about 90 minutes and costs $300.

Twitter has created an online course at Udacity for Apache Storm. It’s a free course at an intermediate level, and is expected to take about 2 weeks.

Fortune has an interview with MapR CEO John Schroeder, where he talks about the company’s milestones, IPO plans, and priorities for 2015.

ComputerWorld UK has coverage of an event in London during which Expedia announced plans to expand their Hadoop cluster and spoke about their experiences with Apache Falcon. The article describes a few ways in which they’re using Falcon at Expedia.


Wibidata, makers of the open-source Kiji Framework, have announced a new feature called Experiments by Wibi. The system allows retailers to setup experiments and send traffic to the experiment and control group to draw statistically significant decisions.

rabbitmq-flume-plugin is Flume plugin providing a source and sink for RabbitMQ. Version 1.0.3 was released this week.

Etsy has open-sourced their tool for profiling Hadoop clusters. The system, statsd-jvm-profiler, is a java agent which outputs data using the StatsD protocol. They include scripts for visualizing the data using flame graphs, and have an overview of the implementation and some potential pitfalls.

Cloudian has announced a new version of its HyperStore software, which is fully compliant with the Amazon S3 API and offers integration with Apache Hadoop.


Curated by Mortar Data ( )



Apache Ambari for Hadoop Clusters, with Newton Alex and Alexander Denissov (Palo Alto) - Tuesday, January 20

January SF Hadoop Users Meetup (San Francisco) - Wednesday, January 21

Bay Area Hadoop User Group Monthly Meetup (Sunnyvale) - Wednesday, January 21

Ingesting HDFS Data into Solr Using Spark (Palo Alto) - Wednesday, January 21


Kubernetes and Mesos Power 90 (Seattle) - Wednesday, January 21


Surfing Data with Apache Drill and SQL-on-Hadoop Technologies (Boulder) - Thursday, January 22


Target's Journey with Hadoop (Minneapolis) - Wednesday, January 21

The Best of the Bunch for SQL-on-Hadoop (Saint Paul) - Thursday, January 22


Revolution Analytics (Saint Louis) - Tuesday, January 20


Intro to Spark (Nashville) - Wednesday, January 21


Brandon Newell from MapR Presents on Apache Drill (Huntsville) - Tuesday, January 20


Intro to Hadoop & Analytics (Atlanta) - Wednesday, January 21


Hadoop & Hive/Pig (Harrisburg) - Tuesday, January 20

New Jersey

Streaming and CEP: DataTorrent vs Storm vs Spark (Hamilton Township) - Tuesday, January 20

New York

Apache Mesos for Apache Kafka and Apache Accumulo (New York) - Wednesday, January 21

Self-Service Data Exploration and Nested Data Analytics on Hadoop (New York) - Friday, January 23

Spark on/with Hadoop (Amherst) - Thursday, January 22


Introduction to Big Data Techniques for Cybersecurity (Rockville) - Wednesday, January 21


Get the Most Out of Spark on YARN (Boston) - Tuesday, January 20


Deploying and Using Spark on AWS (Vancouver) - Monday, January 19


ElasticSearch and Apache Spark (Bath) - Thursday, January 22


Conoce Spark Streaming (Madrid) - Tuesday, January 20

Fifth Barcelona Spark Meeting (Barcelona) - Thursday, January 22


Big Data and Real-time Analytics with Spark (Bangalore) - Friday, January 23