Data Eng Weekly


Hadoop Weekly Issue #88

21 September 2014

There are a number of posts covering the recently released Apache Spark 1.1, Apache Drill 0.5.0-incubating, and Apache Tez 0.5.0. In addition, there’s a look at Hadoop in the healthcare industry, a look at ORCFile for non-Hive workloads, instructions for building a Hadoop setup on Mac, and more. The amount of content this week shows that we’re past the summer lull, and I expect to see lots more great content this fall.

Technical

The first of several posts on Apache Spark 1.1.0, this one covers Spark Streaming. New features of Spark Streaming in the 1.1.0 release include integration with Amazon Kinesis and high availability for Apache Flume. The post gives an intro to Spark Streaming, the new features, and talks through several example use cases.

http://databricks.com/blog/2014/09/16/spark-1-1-the-state-of-spark-streaming.html

Spark 1.1 also includes improvements to PySpark, which provides Python access to Spark. The main improvement is API support for arbitrary Hadoop InputFormats. This post includes an example of using SequenceFiles from PySpark and a walk-through of the new Converter trait, which is used to map custom types to POJOs.

http://databricks.com/blog/2014/09/17/spark-1-1-bringing-hadoop-inputoutput-formats-to-pyspark.html

This tutorial has instructions for setting up a single-node Hadoop cluster including HDFS, YARN, HBase, and Flume on a Mac. The instructions are for CDH but most of the content should be useful for any distribution.

http://blog.cloudera.com/blog/2014/09/how-to-install-cdh-on-mac-osx-10-9-mavericks/

This post is an introduction and overview of stream processing frameworks. It introduces use-cases (including a closer look at fraud detection), gives an overview of stream processing architecture, and intros several different systems (Apache Storm, Apache Spark, IBM InfoSphere Streams, and TIBCO StreamBase). The article also has a section on integrating with Hadoop and other data warehouses.

http://www.infoq.com/articles/stream-processing-hadoop

The Cloudera blog has a post summarizing a research project out of Zurich University to evaluate Cloudera Impala for mixed workloads. The post describes the use case, the report conclusions (namely that Impala scales linearly with more users), and includes a link to the full evaluation.

http://blog.cloudera.com/blog/2014/09/how-impala-supports-mixed-workloads-in-multi-user-environments/

This post looks at using Apache Spark in two different ways. First, it shows how to compute summary statistics and other aggregations on a data stream. Second, it explores generating parquet files from Spark. After a few false starts (trying to use Scala libraries for Avro, Thrift, and Protobuf), the post shows how to use Java Protobufs as the write support for Parquet.

http://arnon.me/2014/09/apache-spark-parquet/

Apache Samza (incubating) is a stream processing framework that integrates well with Apache Kafka. Samza apps run on Apache YARN. This post on the LinkedIn blog describes how they use Kafka and Samza as a platform for distributed tracing (looking at the interaction between services in a service-oriented architecture). They describe a number of improvements to Smaza that have made to scale distributed tracing. The post also includes a description of using Samza to implement some foundational operations such as a cogroup.

http://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-using-apache-samza

These slides are from a presentation walking through the evolution of the data pipeline at Tapad. They first describe Tapad’s data challenges and then walk through the data pipeline, which eventually converged on Avro and Kafka as the core components. The slides include details about how Tapad moves data from Kafka to HDFS (rewriting as Parquet along the way) and uses Scalding to build MapReduce jobs.

http://www.slideshare.net/tobym/data-pipeline-at-tapad

ORCFile is a columnar storage format which comes with Hive. The format can be used with other systems, such as Cascading and Apache Crunch. This article provides an introduction to ORCFile, provides examples of using it with Cascading and Crunch, and provides some example benchmarks demonstrating impressive performance improvements.

http://hortonworks.com/blog/using-orcfile-cascading-apache-crunch/

Altiscale, as a Hadoop as a Service provider, has seen customers write and deploy Spark applications. They’ve put together some guidelines for when it’s worth considering Spark instead of MapReduce. Among them, when you need one of Spark’s machine learning or graph algorithm implementations or when existing MapReduce jobs are slow or are implementing iterative algorithms.

https://www.altiscale.com/apache-spark/

News

A post on SiliconANGLE looks at a number of surveys and reports about Hadoop adoption and usage. It tries to answer questions about who is using Hadoop, how it’s being used (e.g. SQL vs search), and in which industries Hadoop adoption is the strongest.

http://siliconangle.com/blog/2014/09/19/the-state-of-hadoop-2014-whos-using-it-and-why/

Apache Spark has gained a lot of momentum over the past year, and a lot of folks see it as an evolutionary replacement of MapReduce. A post on Datanami suggests three areas that Spark needs to improve in order for that to happen. The areas are high-end scalability (thousands or 10s of thousands of nodes), publication of successful case studies, and short- and long-term backwards compatibility (learning from Hadoop’s mistakes).

http://www.datanami.com/2014/09/15/three-things-apache-spark-needs-hadoop-hadoop/

DataBricks and O’Reilly have announced a new certification program for Apache Spark developers. The program includes an exam that validates technical expertise in Spark. The first set of certifications will be done around Strata NY + Hadoop World in October.

http://databricks.com/blog/2014/09/18/databricks-and-oreilly-media-launch-certification-program-for-apache-spark-developers.html http://radar.oreilly.com/2014/09/announcing-spark-certification.html

This article looks at the role that Hadoop plays in the healthcare industry. Notably, while centralizing data in Hadoop has a lot of advantages for analysis and generating insight, it also adds security risk (since all your data is in one place). The article includes a brief look into how the Children’s Hospital Los Angeles is using Hadoop.

http://www.virtual-strategy.com/2014/09/15/why-hadoop-just-what-doctor-ordered

The Amplify Partners Data & Analytics Fellowship covers the cost of travel and registration to the upcoming Strata Conference. Applicants should “demonstrate passion and potential to meaningfully contribute to the field.” Amplify is giving extra consideration to individuals who do not typically have access to these types of opportunities. Applications are due on September 30th.

http://www.amplifypartners.com/fellowships/amplify-partners-data-analytics-fellowship/

Cloudera and Dell announced that Dell has joined the Cloudera systems integrator program. More details on the program and Dell’s offering are in the press release.

http://cloudera.com/content/cloudera/en/about/press-center/press-releases/2014/09/15/cloudera-names-dell-a-systems-integrator-partner.html

Releases

Apache Tez 0.5.0 was recently released. It includes a number of improvements across APIs, documentation, security, and more. More details on the improvements of the release can be found in the original release announcement on the Tez mailing list as well as in a post on the Hortonworks blog.

http://mail-archives.apache.org/mod_mbox/tez-user/201409.mbox/%3C8127cda3931c8d61b61a43a2aeaff28a%40mail.gmail.com%3E http://hortonworks.com/blog/introducing-apache-tez-0-5/

Adding to recent buzz around Apache Tez, SequenceIQ has announced a new docker image to run Tez. It builds on the Ambari docker image, and there’s a script to deploy a multi-node cluster.

http://blog.sequenceiq.com/blog/2014/09/19/apache-tez-cluster/

A new beta release of Apache Drill, version 0.5.0, was announced. Drill is a SQL-on-Hadoop (and data stored in other places) system. The new version uses Hadoop 2.4.1, has improvements for sorting when data doesn’t fit in memory, and several other improvements. The release resolves over 100 issues, and the Drill team is aiming to do monthly releases as they march towards GA.

https://blogs.apache.org/drill/entry/apache_drill_beta_release_see

After the recent release of Apache Spark 1.1.0, the folks at SequenceIQ have published a new docker image for that project. Running Spark in a docker container can be a great way of getting started, and the post has a few examples of running Spark jobs with it.

http://blog.sequenceiq.com/blog/2014/09/17/spark-1-1-0-docker/

MapR announced that they’re including the recently released Apache Drill 0.5 in their distribution. An introductory post on the MapR blog provides a number of reasons why you might want to adopt Drill. At the top of the list is Drill’s ANSI SQL support, which eases integration with other systems. It also highlights several features that (as a whole) differentiate Drill, such as query without centralized schema, support for nested data, and out-of-the-box ease of use.

http://www.cio.com/article/2683676/big-data/mapr-aims-to-take-sql-on-hadoop-to-next-level.html https://www.mapr.com/blog/top-10-reasons-using-apache-drill-now-part-mapr-distribution-including-hadoop

In addition to adding support for Apache Drill, the 4.0.1 release of the MapR Distribution includes updates to several ecosystem projects. Updates include Hadoop core 2.4.1, Spark 1.0.2, and HBase 0.98.4. Also, Apache Storm is certified with MapR 4.0.1 and Tez is in a developer preview.

https://www.mapr.com/blog/apache-open-source-package-updates-september-include-apache-drill-apache-spark-102-and-more

Amazon Web Services announced a new EMR File System, which replicates metadata about the state of data in S3 to DynamoDB. By using DynamoDB, the EMR File System can provide a consistent view of data in S3, which is eventually consistent (particularly when doing S3 prefix listings). The post has some details on getting started with the EMR File System, which requires an initial sync command to initialize the data in DynamoDB.

http://aws.amazon.com/blogs/aws/emr-consistent-file-system/

EPIC is a new product from BlueData for provisioning Hadoop. It bundles KVM, RHEL, and cloud management from OpenStack. EPIC is certified with the Hortonworks distribution (HDP) and also supports Cloudera’s CDH. EPIC One, a single node version, is available now. The enterprise edition is still in beta but expected in Q4.

http://www.datanami.com/2014/09/17/self-provision-hadoop-five-clicks-bluedata-says/

SequenceIQ has also released new docker images for Apache Hadoop 2.5.1. Their post features some examples for interacting with a container running the image.

http://blog.sequenceiq.com/blog/2014/09/15/hadoop-2-5-1-docker/

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

Intro to Hadoop: Hype or Reality? You Decide, with Kevin Crocker (Palo Alto) - Tuesday, September 23
http://www.meetup.com/Pivotal-Open-Source-Hub/events/203775512/

Women in Analytics September Event: Hadoop and Other DB Technology (San Francisco) - Thursday, September 25
http://www.meetup.com/Women-in-Analytics-Bay-Area/events/205884332/

How to Offload the ELT SQL in Your Data Warehouse into Hadoop Automatically (Mountain View) - Thursday, September 25
http://www.meetup.com/Hadoop-Talks/events/197024412/

Discussion of Hadoop Use Cases vs. Runtime Environments, by Tom Phelan of BlueData (Los Angeles) - Thursday, September 25
http://www.meetup.com/LA-HUG/events/200568132/

Hadoop: Where Did It Come from and What's Next? by Eric Baldeschwieler (Pasadena) - Thursday, September 25
http://www.meetup.com/Pasadena-Big-Data-Users-Group/events/203961192/

Washington

Getting Started with SQL on Hadoop (Seattle) - Tuesday, September 23
http://www.meetup.com/Big-Data-Developers-in-Seattle/events/198462582/

Seattle Scalability Meetup: Google, HWX, Zulily (Seattle) - Wednesday, September 24
http://www.meetup.com/Seattle-Scalability-Meetup/events/174605492/

Utah

EMR, S3, and Hadoop Use Cases (South Jordan) - Thursday, September 25
http://www.meetup.com/BigDataUtah/events/204733282/

Minnesota

Large-Scale Analytics with Apache Spark (Saint Paul) - Monday, September 22
http://www.meetup.com/Twin-Cities-Hadoop-User-Group/events/203324672/

Michigan

Tech Talk with Eddie Garcia, Info Security Architect at Cloudera (Southfield) - Tuesday, September 23
http://www.meetup.com/greatlakes_cug/events/203300692/

Ohio

NKU Big Data Series: Hadoop 101 (Cincinnati) - Thursday, September 25
http://www.meetup.com/CincinnatiBI/events/201721372/

Alabama

Jeff Holoman Presents on Cloudera Distribution of Apache Hadoop (Huntsville) - Wednesday, September 24
http://www.meetup.com/Huntsville-Big-Data-Meetup/events/200988162/

Florida

Centralized Logging: Industry First Approach to HBase Fans (Jacksonville) - Thursday, September 25
http://www.meetup.com/HUGNOFA/events/184997382/

North Carolina

Making Business Decisions with SAS & Hadoop (Charlotte) - Wednesday, September 24
http://www.meetup.com/CharlotteHUG/events/167352382/

Pennsylvania

YARN (Pittsburgh) - Tuesday, September 23
http://www.meetup.com/HUG-Pittsburgh/events/202171882/

New Jersey

Tableau Deep Dive: Big Data Visualization (Hamilton Township) - Tuesday, September 23
http://www.meetup.com/nj-hadoop/events/205597122/

Massachusetts

Full-Day Hadoop MapReduce Hands-On Workshop (Cambridge) - Friday, September 26
http://www.meetup.com/FREE-Big-Data-Hands-On-Workshops/events/206050702/

FRANCE

Hadoop User Group: YARN, Falcon, HBase... (Paris) - Monday, September 22
http://www.meetup.com/Hadoop-User-Group-France/events/204787122/

SPAIN

Third Spark Barcelona Meeting (CSIC) (Barcelona) - Monday, September 22
http://www.meetup.com/Spark-Barcelona/events/186861962/

ENGLAND

Real-Time Analytics Using Indexed MapReduce (London) - Thursday, September 25
http://www.meetup.com/Scale-Warriors-of-London/events/207630082/

CANADA

Intro to Lambda Architectures & Development (Toronto) - Friday, September 26
http://www.meetup.com/TorontoHUG/events/202031062/

INDIA

SolrCloud, Solr + Hadoop 2 & Nutch Integration (Bangalore) - Saturday, September 27
http://www.meetup.com/Bangalore-Baby-Apache-Solr-Group/events/200146762/

RUSSIA

HadoopKitchen (Moscow) - Saturday, September 27
http://www.meetup.com/Hadoop-Moscow/events/195025402/