Data Eng Weekly

Hadoop Weekly Issue #117

19 April 2015

Hadoop Summit Europe was this week in Brussels, and there were a number of announcements and presentations related to that event (and I'm sure we'll see even more in the next few weeks). Among the announcements, Hortonworks has acquired SequenceIQ (makers of tools for Hadoop in the cloud), and HDP now fully supports Apache Spark (in case it wasn't yet clear that Spark is booming). Speaking of Spark, there are technical articles on Spark's GraphX, MLlib, and catalyst (the Spark SQL optimizer) as well as a few great posts on distributed systems that touch on Hadoop.


AppNexus has written about their experiences with Parquet. In comparison to snappy-compressed sequence files, snappy compressed parquet files use substantially less storage and aid performance (fewer and faster map tasks) across a number of Hive queries.

This post looks at how to use Spark's GraphX to process RDF data. Specifically, it runs GraphX's connected components implementation on the graph defined by the related values field of the Library of Congress Subject header dataset.

This post describe how to integrate Luigi, the workflow engine, with Google Cloud's BigQuery. The author shares some code for running BigQuery tasks as well as experiences in improvement to throughput and cluster utilization after introducing Luigi.

The Databricks blog has an interesting look at how the Spark SQL "catalyst" optimizer works. It discusses Trees, which are the main data type manipulated by the optimizer, Rules, which optimize a query by transforming from one tree to another, the logical optimization phase, physical planning, and code generation (which makes use of Scala's quasiquotes). The post is based on a paper that's to appear at SIGMOD 2015.

The Hortonworks blog has a look at the new security-related features in Apache Ambari 2.0. These include setting up Kerberos and deploying/configuring Apache Ranger.

This presentation from Hadoop Summit covers some of the recent and ongoing work related to SQL with Hive and HBase. It looks at Hive's "Live Long And Process" daemons for sub-second queries, work on a new HBase-backed Hive metastore (to replace a RDBMS), and some thoughts on how Apache Phoenix (which adds SQL atop of HBase) could leverage some of Hive's components.

The morning paper has coverage of two-Hadoop related papers this week. The first, "Cross-layer scheduling in cloud systems," looks at how a network-layer aware scheduler can improve throughput. For MapReduce, the authors show improvement during the shuffle phase. The second post looks at "ApproxHadoop: Bringing Approximations to MapReduce Frameworks," which uses sampling, task dropping, and user-defined approximation to reduce execution time when approximation is acceptable.

The CAP theorem is the basis of a lot of distributed system research and applications. Unfortunately, it's often misunderstood—particularly when it comes to availability. This post describes the various parts of the CAP theorem and gives real-world examples of several CAP trade-offs (HDFS is used as the example of a CP system).

This post looks at using Spark ML to analyze network data. It describes frequent pattern mining, and how to use Spark 1.3's implementation of Parallel FP-growth to compute it. The authors describe how Spark's implementation scales in comparison to Mahout. The post also describes MLlib's Power Iteration Clustering.


When I got started with Hadoop for a university project, "Hadoop: The Definitive Guide" was my best resource. Hadoop has changed a lot since then, and the book is now on its fourth edition. The Cloudera blog has a post describing what's new in the latest version.

Hortonworks announced that they've acquired SequenceIQ, the makers of Cloudbreak (docker-based, cloud-agnostic provisioning of Hadoop clusters) and Periscope (autoscaling of Hadoop clusters in the cloud). The Hortonworks blog has more details on Cloudbreak and Periscope and their plans for these projects (available as a Tech Preview and incubate under the Apache Software Foundation).

In January, Hortonworks established the Data Governance Initiative (DGI) with several industry partners (Aetna, Merck, Target, and SAS). This week, the members of the DGI submitted a new project called Atlas to the Apache Incubator. Apache Atlas aims to address governance requirements like data classification, centralized auditing, search and lineage, and security/policy engine.

CMSWire has coverage of an announcement by three members of the Open Data Platform (ODP)—Hortonworks, IBM, and Pivotal. The trio announced that they've standardized on Apache Hadoop 2.6 and Apache Ambari for the ODP. The article contains coverage of the announcement as well as discussion about Cloudera's and Mapr's choice not to join the ODP.

"My data is bigger than your data!" is a blog that tracks the size of Hadoop and other big data systems that folks mention during public venues. It was recently updated with new numbers from Hadoop Summit, including the size of Yahoo!'s Hadoop deployment (600PB and 43K nodes).

Enterprise Apps Today has an interview with Tom White, author of "Hadoop: The Definitive Guide." The interview discusses a number of new and growing parts of the Hadoop ecosystem, including Spark, Crunch, Flume, Parquet, YARN, and Kafka.

Apache Zeppelin (incubating) is a relatively new project (added to the incubator last December) for building web-based notebooks (a la IPython) for Spark, SQL, and more. There isn't yet a release, but there are instructions for building from source.


JethroData, which is a SQL-on-Hadoop solution, recently announced that version 1.0 is generally available. JethroData combines full-indexing with a columnar store to power BI tools like Qlik, Tableau and Microstrategy. It is compatible with Cloudera's CDH, Hortonworks' HDP, MapR, and AMazon EMR.

Seagate has open-sourced their implementation of the Hadoop FileSystem API that's backed with Lustre. It's been tested with both CDH 5.3.1 and HDP 2.2.

Hortonworks has announced that Apache Spark 1.2.1 is now generally available as part of HDP 2.2.4. Hortonworks has integrated Spark with the ORC file format, Apache Ambari, and more.

Cloudera announced a number of releases this week. Cloudera Enterprise 5.0.6, 5.1.5, and 5.2.5 all fix key bugs in HDFS, YARN, and Hive (there are slightly different fixes in each release, though). Cloudera Director 1.1.2 includes bug fixes and support for new instance types.

Apache Spark 1.2.2 and 1.3.1 were released. Version 1.2.2 has a number of fixes to Spark Core and PySpark. Version 1.3.1 includes fixes for Spark SQL, Spark Streaming, PySpark, and Spark Core.


Curated by Datadog ( )



If You Don't Know the "Apache Way," You Really Don't Know Open Source (Palo Alto) - Tuesday, April 21

April 2015 Hive Contributors Meetup (Santa Clara) - Wednesday, April 22

Why Is My Spark Job Failing? by Sandy Ryza of Cloudera (Santa Monica) - Thursday, April 23


Apache Spark Tutorial with Paco Nathan (Boulder) - Thursday, April 23


How McGyver Learned to Leave Duct Tape Behind and Use Spark Instead (McLean) - Wednesday, April 22

Elastic Analytics with Spark, Mesos and Docker (Vienna) - Thursday, April 23

District of Columbia

Scaling with Couchbase, Kafka, and Apache Spark (Washington) - Tuesday, April 21

New Jersey

Real-Time Data Warehouse: Hadoop on SQL and Hive (Hamilton Township) - Tuesday, April 21

New York

Couchbase, Kafka, Spark, Hadoop: Polyglot Persistence and the Big Data Pipeline (New York) - Monday, April 20


Hadoop and R, with Brent Sitterly of KSV (Burlington) - Wednesday, April 22

IRELAND The Data Lake Use Case: Insights in Days or Weeks Rather Than Months (Dublin) - Wednesday, April 22


Time-Series Analysis with Spark and Cassandra (London) - Monday, April 20


Hadoop ELT Lab with Hive, Sqoop, Pig, HBase, and Hue (Oslo) - Monday, April 20


Recommendation Engines in the Cloud with HDInsight/Hadoop in Azure (Aarhus) - Monday, April 20


NoSQL in a Hadoop World (Munich) - Wednesday, April 22


An Introduction to Real-Time Spark (Bangalore) - Saturday, April 25


Apache Spark: Introducing DataFrames for Large Scale Data Science (Sydney) - Thursday, April 23

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit