Hadoop Weekly Issue #104

18 January 2015

We have a lot of great technical posts this week covering security in Hadoop, recent improvements to Spark, HDFS’ new heterogenous disk support, and the relatively new Apache incubator project NiFi. In addition, Apache Flink graduated from the incubator, Etsy open-sourced a new tool for profiling Hadoop jobs, and much more.

Technical

An important feature for an enterprise storage system is data protection. This post looks at Hadoop’s data protection features including replication, checksums, HDFS snapshots, and distcp for cross-cluster replication. It also looks at some enterprise software from EMC that can integrate via Hadoop’s S3 API support, and the tools that come from major vendors. There’s also an overview of what protection there are for different types of events (data corruption, accidental deletion, etc).

http://www.beebotech.com.au/2015/01/data-protection-for-hadoop-environments/

Spotify has written another post about their personalization pipeline, which is powered by Kafka, Hadoop, and Cassandra. While the post from last week focussed on the recommendation-building layer, this post looks at the database layer, which is powered by Cassandra. The article describes things like schema design, cross-datacenter replication, and bulk transfer. If you’re using or thinking of using Cassandra in production, there are several highly-relevant details.

https://labs.spotify.com/2015/01/09/personalization-at-spotify-using-cassandra/

Apache NiFi (incubating) has support for a number of different data transfer and file formats. This post looks at its support for HTTP and Kafka—it walks through building a pipeline to download a zip-compressed file over HTTP, extract messages from files within the zip, and send those messages to Kafka.

https://blogs.apache.org/nifi/entry/integrating_apache_nifi_with_apache

Another post on Apache NiFi discusses using the system for ETL and gives a tour of the data provenance features. The post argues for combining data from multiple sources in order to simplify common operations against data, and it describes how you can track the source of data in such a setup using the builtin data provenance tools.

https://blogs.apache.org/nifi/entry/basic_dataflow_design

The eBay tech blog has a post on HDFS’ (relatively new) support for tiered storage. The post describes the feature, which allows a cluster to define different types of storage, and various storage policies that describe how to map replicas to tiers. eBay uses this feature to move much of their data to machines with a small amount of compute power but a large disk density (220TB each).

http://www.ebaytechblog.com/2015/01/12/hdfs-storage-efficiency-using-tiered-storage/

The Hortonworks blog also has a post on HDFS’ heterogeneous storage support. The post describes archival storage (similar to the eBay setup) as well as SSD storage and (available in technical preview) in-memory storage.

http://hortonworks.com/blog/heterogeneous-storage-policies-hdp-2-2/

Spark has been criticized for not being as scalable as MapReduce on large clusters and datasets. Some folks from Cloudera and Intel have been working on improving the scalability and performance of one of the bottlenecks—the shuffle phase of Spark jobs. This post looks at those changes in detail and provides some performance comparisons of the new shuffle implementation.

http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/

The Databricks blog has a post on journaling for fault-tolerance in Spark streaming. The post motivates the feature (which is necessary when consuming data from systems like Kafka and Flume), describes the implementation (which uses a write-ahead log on a distributed file system like HDFS), and describes how to use and configure journaling for Spark streaming.

http://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html

The AWS blog has a post in which BloomReach describes several techniques that they use to manage costs of Elastic MapReduce. Tips include using spot instances, shared clusters for smaller jobs, tags for tracking cost, and a simple algorithm for computing the best instance type given fluctuations in spot pricing.

http://blogs.aws.amazon.com/bigdata/post/Tx3L1N4PH3MPPIF/Strategies-for-Reducing-Your-Amazon-EMR-Costs

The Cloudera blog has a post about setting up a Hadoop cluster. It covers key areas like networking, DNS, OS configuration, storage, and hardware specs. There are a few CDH-specific recommendations, but overall this is a great resources for anyone building a Hadoop cluster.

http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache-hadoop-clusters-like-a-boss/

This is a quick post describing how to interface with Avro data from Spark. The post shows how to convert cvs data to Avro and then write a small Spark program in Scala for reading the data.

http://matthieulieber.blogspot.com/2015/01/how-to-load-some-avro-data-into-spark.html

This post describes some of the tools that are popular when analyzing scientific data with Hadoop. A lot of popular tools use non-JVM languages (R, Python), which can make it difficult to interface with Hadoop. This post looks at the current state of support (and what can be improved) across three key areas - storing open data, web-based notebooks (popularized by IPython), and distributed data frames.

http://www.tom-e-white.com/2015/01/hadoop-for-science.html

News

Apache Flink, a distributed data processing framework, graduated from the Apache incubator this week. Originally called stratosphere, the project has some similarities to Apache Spark (e.g. it aims to be faster at MapReduce for iterative algorithms).

http://www.datanami.com/2015/01/12/apache-flink-takes-route-distributed-data-processing/

Qubole’s weekly roundup of Hadoop happenings has coverage of ten industry articles. These include Hadoop for enterprise, big data for banking, and Hadoop at AutoDesk. There are also several industry prediction/retrospective articles.

http://www.qubole.com/re-framing-hadoop/

Databricks and O’Reilly have announced a new online Spark Certified Developer exam. The exam takes about 90 minutes and costs $300.

http://databricks.com/blog/2015/01/16/spark-certified-developer-exams-available-online.html

Twitter has created an online course at Udacity for Apache Storm. It’s a free course at an intermediate level, and is expected to take about 2 weeks.

https://www.udacity.com/course/ud381

Fortune has an interview with MapR CEO John Schroeder, where he talks about the company’s milestones, IPO plans, and priorities for 2015.

http://fortune.com/2015/01/14/mapr-eyes-late-2015-ipo/

ComputerWorld UK has coverage of an event in London during which Expedia announced plans to expand their Hadoop cluster and spoke about their experiences with Apache Falcon. The article describes a few ways in which they’re using Falcon at Expedia.

http://www.computerworlduk.com/news/it-business/3593977/expedia-to-double-its-apache-hadoop-cluster-investment-this-year/

Releases

Wibidata, makers of the open-source Kiji Framework, have announced a new feature called Experiments by Wibi. The system allows retailers to setup experiments and send traffic to the experiment and control group to draw statistically significant decisions.

http://venturebeat.com/2015/01/12/this-new-wibidata-software-could-help-retailers-sell-far-more-stuff-online/

rabbitmq-flume-plugin is Flume plugin providing a source and sink for RabbitMQ. Version 1.0.3 was released this week.

https://github.com/aweber/rabbitmq-flume-plugin/releases/tag/1.0.3

Etsy has open-sourced their tool for profiling Hadoop clusters. The system, statsd-jvm-profiler, is a java agent which outputs data using the StatsD protocol. They include scripts for visualizing the data using flame graphs, and have an overview of the implementation and some potential pitfalls.

https://codeascraft.com/2015/01/14/introducing-statsd-jvm-profiler-a-jvm-profiler-for-hadoop/

Cloudian has announced a new version of its HyperStore software, which is fully compliant with the Amazon S3 API and offers integration with Apache Hadoop.

http://siliconangle.com/blog/2015/01/16/cloudian-integrates-apache-hadoop-to-turn-big-data-into-smart-data/

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

Apache Ambari for Hadoop Clusters, with Newton Alex and Alexander Denissov (Palo Alto) - Tuesday, January 20
http://www.meetup.com/Pivotal-Open-Source-Hub/events/219339139/

January SF Hadoop Users Meetup (San Francisco) - Wednesday, January 21
http://www.meetup.com/hadoopsf/events/219049168/

Bay Area Hadoop User Group Monthly Meetup (Sunnyvale) - Wednesday, January 21
http://www.meetup.com/hadoop/events/167785202/

Ingesting HDFS Data into Solr Using Spark (Palo Alto) - Wednesday, January 21
http://www.meetup.com/SFBay-Lucene-Solr-Meetup/events/219786571/

Washington

Kubernetes and Mesos Power 90 (Seattle) - Wednesday, January 21
http://www.meetup.com/Seattle-Mesos-Meetup/events/219684171/

Colorado

Surfing Data with Apache Drill and SQL-on-Hadoop Technologies (Boulder) - Thursday, January 22
http://www.meetup.com/CU-Leeds-Business-Analytics/events/219688951/

Minnesota

Target's Journey with Hadoop (Minneapolis) - Wednesday, January 21
http://www.meetup.com/Skyway-Software-Symposium/events/219393925/

The Best of the Bunch for SQL-on-Hadoop (Saint Paul) - Thursday, January 22
http://www.meetup.com/Twin-Cities-Hadoop-User-Group/events/219661224/

Missouri

Revolution Analytics (Saint Louis) - Tuesday, January 20
http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/219101118/

Tennessee

Intro to Spark (Nashville) - Wednesday, January 21
http://www.meetup.com/Nashville-Hadoop-Meetup/events/218911783/

Alabama

Brandon Newell from MapR Presents on Apache Drill (Huntsville) - Tuesday, January 20
http://www.meetup.com/Huntsville-Big-Data-Meetup/events/219152975/

Georgia

Intro to Hadoop & Analytics (Atlanta) - Wednesday, January 21
http://www.meetup.com/Atlanta-Society-for-Business-Intelligence/events/219394022/

Pennsylvania

Hadoop & Hive/Pig (Harrisburg) - Tuesday, January 20
http://www.meetup.com/Central-PA-Hadoop-and-Big-Data-User-Group/events/217628962/

New Jersey

Streaming and CEP: DataTorrent vs Storm vs Spark (Hamilton Township) - Tuesday, January 20
http://www.meetup.com/nj-hadoop/events/219357759/

New York

Apache Mesos for Apache Kafka and Apache Accumulo (New York) - Wednesday, January 21
http://www.meetup.com/Apache-Mesos-NYC-Meetup/events/219073804/

Self-Service Data Exploration and Nested Data Analytics on Hadoop (New York) - Friday, January 23
http://www.meetup.com/New-York-Apache-Drill-Meetup/events/219730994/

Spark on/with Hadoop (Amherst) - Thursday, January 22
http://www.meetup.com/buffalolab/events/219349543/

Maryland

Introduction to Big Data Techniques for Cybersecurity (Rockville) - Wednesday, January 21
http://www.meetup.com/Capital-Area-Cyber-Security/events/219333009/

Massachusetts

Get the Most Out of Spark on YARN (Boston) - Tuesday, January 20
http://www.meetup.com/bostonhadoop/events/219194392/

CANADA

Deploying and Using Spark on AWS (Vancouver) - Monday, January 19
http://www.meetup.com/Vancouver-Spark/events/197407432/

UNITED KINGDOM

ElasticSearch and Apache Spark (Bath) - Thursday, January 22
http://www.meetup.com/South-West-ElasticSearch-Community/events/219311668/

SPAIN

Conoce Spark Streaming (Madrid) - Tuesday, January 20
http://www.meetup.com/Madrid-Apache-Spark-meetup/events/219689767/

Fifth Barcelona Spark Meeting (Barcelona) - Thursday, January 22
http://www.meetup.com/Spark-Barcelona/events/186862692/

INDIA

Big Data and Real-time Analytics with Spark (Bangalore) - Friday, January 23
http://www.meetup.com/Spark_big_data_analytics/events/208197842/

Data Eng Weekly