Data Eng Weekly


Hadoop Weekly Issue #75

22 June 2014

Security in Hadoop is a big topic in this week’s issue—there’s coverage from Accumulo Summit, and posts from both Cloudera and Hortonworks on the topic. This week’s issue also covers technical posts on the Kiji Framework, Hadoop + Docker, and Etsy’s predictive modeling software, Conjecture.

Technical

Slides from the recent Accumulo Summit have been posted online. There are 17 presentations from folks at Cloudera, Hortonworks, Sqrrl, and more. Topics include Accumulo and YARN, the Accumulo community, and security for Accumulo.

http://www.slideshare.net/AccumuloSummit

The DataStax Cassandra SF users group hosted a talk on building applications with the Kiji Framework. Kiji, which is a framework for building applications, used to be tied to HBase but recently added Cassandra as a storage backend. Hakka Labs has a video of the presentation, which covers integrating Cassandra and Kiji.

http://www.hakkalabs.co/articles/building-big-data-applications-kiji-framework

There have been plenty of articles and presentations on the differences between Hadoop 1 & Hadoop 2, but there are so many new features that it's good to reiterate them. These slides do a good job of summarizing the major changes without diving into too much of the implementation details and also provide a good overview of what's next for Hadoop.

http://www.slideshare.net/bigdatajoe/hadoop-past-present-and-future-v11-36085792

I've been spending a lot of time with docker recently, and it seems to open up a whole lot of new possibilities. In this case, SequenceIQ is deploying Apache Ambari as a docker container, which in turn spins up a full a Hadoop cluster. They're doing some interesting things with serf for discovering the nodes in the cluster and setting up DNS. The ultimate goal is a docker-based, cloud-independent system for deploying Hadoop clusters.

http://blog.sequenceiq.com/blog/2014/06/17/ambari-cluster-on-docker/

In the first of two posts by Cloudera on Project Rhino and Hadoop security, the Cloudera blog covers some of the technical details of implementing encryption at rest on HDFS. As a quick recap—Project Rhino is a push started by Intel to bring hardware-accelerated encryption support to Hadoop. As part of the Cloudera/Intel deal, Cloudera joins Intel working on Project Rhino. This post covers the role of a key-server and hardware acceleration in implementing HDFS encryption at rest.

http://blog.cloudera.com/blog/2014/06/project-rhino-goal-at-rest-encryption/

Etsy has open-sourced Conjecture, their predictive modeling pipeline. The framework uses Scalding for parallelized model training. The post unveiling Conjecture contains more background on the scalding jobs as well as other parts of the system.

http://codeascraft.com/2014/06/18/conjecture-scalable-machine-learning-in-hadoop-with-scalding/

Kickstarter has posted a presentation on their data pipeline. While they’re using Redshift for the majority of the analysis, they make use of Sqoop and Hadoop Streaming on Amazon’s EMR as well. There are some interesting ideas around their workflow for data requests using Trello.

https://www.kickstarter.com/backing-and-hacking/data-kickstarter-a-guided-tour

The MSDN Blog has a post on using a JSON SerDe with Hive running on Microsoft Azure HDInsight. The post is fairly Windows specific (the Hortonworks blog has a similar post a few months ago targeted towards Linux), but is full of details that should be helpful regardless.

http://blogs.msdn.com/b/bigdatasupport/archive/2014/06/18/how-to-use-a-custom-json-serde-with-microsoft-azure-hdinsight.aspx

This post covers tuning Cloudera Search (which is built on SolrCloud). It covers the HDFS cache configuration and tuning parameters as well as the SlorCloud settings.

http://techkites.blogspot.com/2014/06/performance-tuning-and-optimization-for.html

News

GigaOm has an article about Yahoo's testing of and driving of new features for Hadoop at scale. The post talks about how Hortonworks, who was spun off from the Hadoop team at Yahoo, benefits given their close relationship between the two teams. The post mentions HDP 2 and Storm on YARN as examples of the close partnership.

http://gigaom.com/2014/06/16/when-it-comes-to-hadoop-yahoo-is-still-hortonworks-secret-weapon/

The reliability of the Hadoop platform has increased leaps in bounds in recent years via various high-availability and scalability initiatives. But the reliability and ease of debugging complex jobs running on Hadoop hasn’t really gotten any better. A Forbes contributor article explores this question from a business/competitor perspective. The article asks if this is hurting Hadoop adoption, discusses some of the main causes of job failure, and talks about what some of the Hadoop-as-a-Service vendors are doing to help reliability.

http://www.forbes.com/sites/danwoods/2014/06/16/solving-the-mystery-of-hadoop-reliability/

The Hortonworks blog has a post with details on the security offering from the recently acquired XA Secure product. XA Secure supplements Hadoop's current security with centralized administration, authorization across Hive, HDFS, and HBase, and an audit system. Hortonworks has previously said that this software will be open-sourced in the future.

http://hortonworks.com/blog/hortonworks-offers-holistic-comprehensive-security-hadoop/

Cloudera has also been talking a lot about security, and this post summarizes the state of security on Hadoop. This post explores in detail all of the goals of Project Rhino as well as how it fits together with Apache Sentry (incubating). The post also mentions that some of the employees Cloudera gained as part of the Gazzang acquisition will be working on Project Rhino.

http://vision.cloudera.com/project-rhino-and-sentry-onward-to-unified-authorization/

Radoop, makers of analytics and visualization software for data stored in Hadoop, were acquired this week by RapidMiner. RapidMiner and Radoop were previous partners, and this will help give RapidMinder better support for Hadoop.

http://techcrunch.com/2014/06/16/rapidminer-scoops-up-radoop-for-hadoop-analytics-processing/

The Ovum blog has a post on the Actian Analytics Platform-Hadoop SQL Edition. Yet another entry in the SQL-on-Hadoop system, Actian takes a different approach. Unlike Hive, Impala, and other solutions, Actian doesn’t use the Hive megastore. In contrast, it supports full ANSI SQL 2003 and ACID (via version control). Interestingly, Actian is using Hadoop to scale out their existing solution, which was previously limited to a single node (about 10TB).

http://ovum.com/2014/06/17/actian-plants-its-sql-on-hadoop-stake/

Datanami has a post about stealth startup BlueData, which aims to be “VMWare for big data.” The software aims to ease the complexity of installing Hadoop as well as offering elasticity in the data center. It’s currently in beta.

http://www.datanami.com/2014/06/17/bluedata-eyes-market-hadoop-vms/

Releases

Google has added support for automatic provisioning of Apache Spark and Shark for clusters running in the Google Compute Engine. Version 0.34.3 of the bdutil for the Google Cloud Platform contains these changes. The new version also contains support for configuring a SOCKS proxy to access the cluster's web ui.

https://groups.google.com/forum/#!topic/gcp-hadoop-announce/EfQms8tK5cE

ElasticSearch announced version 2 of their Hadoop connector, which is certified for CDH 5 (in addition to supporting Hortoworks and MapR distributions).

http://www.marketwired.com/press-release/elasticsearch-now-certified-on-cloudera-enterprise-5-releases-new-hadoop-connector-1922422.htm

MapR has announced support for Hive 0.13, Hue 3.5, Sqoop 2, Oozie 4, and HBase 0.94.17. for their distribution. They support multiple versions of Hive, including 0.11 and 0.12, and each of the new versions can run across MapR 3.x releases (3.0.3, 3.1.0, and 3.1.1).

http://www.mapr.com/blog/june-release-ecosystem-packages-mapr-include-hive-13-sqoop-2-and-more

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

An Intro to Apache Spark (with Hadoop) for Plain Old Java Geeks (Palo Alto) - Tuesday, June 24
http://www.meetup.com/Pivotal-Open-Source-Hub/events/184781662/

Unraveling Hadoop Meltdown Mysteries (Palo Alto) - Tuesday, June 24
http://www.meetup.com/SF-Bay-Areas-Big-Data-Think-Tank/events/188261382/

YARN Over MR1: An Operational Win, Presented by Karthik Kambatla (Santa Monica) - Tuesday, June 24
http://www.meetup.com/LA-HUG/events/184628072/

Pepperdata Meetup: War Stories from the Hadoop Trenches (Sunnyvale) - Wednesday, June 25
http://www.meetup.com/Pepperdata/events/183257052/

Rubicon's Hadoop Program (Irvine) - Wednesday, June 25
http://www.meetup.com/OC-HUG/

Apache Sentry: Enterprise-Grade Security for Hadoop (San Jose) - Wednesday, June 25
http://www.meetup.com/BigDataGurus/events/188283272/

Application Architectures with Apache Hadoop: Putting the Pieces Together (Oakland) - Wednesday, June 25
http://www.meetup.com/eastbayjug/events/187925272/

Open Space, Including "Test-Driven Hadoop Workshop" (Westlake Village) - Wednesday, June 25
http://www.meetup.com/codecraftgroup/events/185879572/

LOPSA SD Monthly Meeting: An Introduction to Hadoop (San Diego) - Thursday, June 26
http://www.meetup.com/LOPSA-SD/events/185478552/

Cancer Genomics and Data Sciences (Milpitas) - Sunday, June 29
http://www.meetup.com/Big-Data-Science/events/166815862/

Colorado

Apache Spark: Resilient Distributed Datasets (Denver) - Thursday, June 26
http://www.meetup.com/Distributed-Computing-Denver/events/188620662/

Illinois

Apache Spark as Cross-Over Hit for Data Science (Chicago) - Wednesday, June 25
http://www.meetup.com/Data-Science-Chicago/events/187556982/

North Carolina

June CHUG: Enterprise-Grade Security on Hadoop, with Johndee Burks of Cloudera (Charlotte) - Wednesday, June 25
http://www.meetup.com/CharlotteHUG/events/167111782/

Pennsylvania

Apache Spark (Philadelphia) - Tuesday, June 24
http://www.meetup.com/PhillyDB/events/188406532/

New Jersey

Intro to Big Data and Hadoop with a Cloudera Architect (Upper Montclair) - Tuesday, June 24
http://www.meetup.com/NJ-Cloud-Architects/events/186025182/

New York

Network Design Considerations and Challenges for Hadoop 'Big Data' Environments (New York) - Wednesday, June 25
http://www.meetup.com/Arista-Networks-NYC-Area-Group/events/186710762/

CANADA

Ontario

Ottawa Hadoop User Group Kick-Off Meetup, with Guest Speaker from Hortonworks (Ottawa) - Tuesday, June 24
http://www.meetup.com/HadoopOttawa/events/187358662/

YARN Roadmap, with Guest Speaker Joseph Niemiec (Toronto) - Wednesday, June 25
http://www.meetup.com/TorontoHUG/events/176006282/

NETHERLANDS

Bol.com's Multifactor Hadoop-Based Recommender + Hadoop Warehousing with Impala (Utrecht) - Wednesday, June 25
http://www.meetup.com/Netherlands-Hadoop-User-Group/events/177087352/

SWEDEN

SHUG 12: Improving Developer Productivity in the Big Data World, with David Whiting (Stockholm) - Wednesday, June 25
http://www.meetup.com/stockholm-hug/events/189592672/