Data Eng Weekly

Hadoop Weekly Issue #130

19 July 2015

This week's newsletter has several articles that cover the broad topic of distributed systems—not directly Hadoop but certainly good background. Articles on Spark, Kafka, Pig, and Hadoop security round out the technical articles, and there are quite a few releases. These include Spark, Atlas, and Druid. Finally, be sure to see the details below to enter a raffle for a free ticket to Hadoop World in NYC—submissions are accepted through end of day July 22nd.


Automated testing of distributed systems is starting to get some attention as the industry realizes that our current tools and tests aren't sufficient. This post looks at this problem and discusses three scenarios for testing—end-to-end, asynchronous data delivery, and node failure.

For testing the node failure scenario, the Eventuate chaos testing utilities look interesting. The introductory post describes using boot2docker (on Mac OS X) and docker to run several Cassandra nodes. After setting up custom routes on the host, a client can be tested as the tool randomly stops and starts containers, which simulates node failure in a distributed system.

To continue on the distributed systems theme, this post (and the response) contain a discussion of the CAP theorem. The article recaps some key conversations about the CA category, describes CA, discusses the overlap between CP and CA, and discusses some practical considerations related to network partitions.

The Altiscale blog has a post with some tips for working with Spark. The post covers verifying Spark job configuration, configuring log levels, and where to find logs for Spark jobs.

This post accompanies slides from a recent presentation on the various mechanisms for deploying Spark. It covers the commonalities independent of deployment mechanism and the major deployment options—standalone mode, YARN, Mesos, DataStax Enterprise, and Spark in the cloud. The post also intros/plugs the Spark Job Server, which is a REST api for managing Spark jobs.

ComputerWorld has coverage on a recent research project at MIT on building a FPGA and Flash-based server network to offer an alternative to distributed applications that require all data in RAM. They've shown that their setup can store/process the same amount of data as an all-memory cluster for much cheaper.

The Databricks blog has a post on Spark 1.4's new support for window functions in SparkSQL and the DataFrames API. The article describes the basics of window functions, gives several examples, and enumerates all the supported functionality.

The Hortonworks blog has a post describing another one of Spark 1.4's new features—support for Apache ORC. The integration includes support for ORCs column pruning, predicate push-down, and partition pruning. The introductory post describes these features and gives examples of using ORC via scala and java.

Apache Mesos is a system for managing resources across a fleet of servers in a data center (with some similarities to YARN). A post on the Confluent blog describes the Kafka integration with Mesos, including some of the special considerations for running a stateful service inside of Mesos. Details include the Kafka mesos scheduler and executer, a tutorial for using the integration, and the admin REST API for the Kafka Mesos framework.

This post discusses several considerations for tuning MapReduce jobs. it focuses on the various configuration options for Pig, but the diagnostic steps and suggestions are generally applicable to several frameworks.

Apache Hadoop committer Steve Loughran has a review of the new Hadoop Security book. It concludes that "If you are trying to secure a Hadoop cluster, invest in a copy." Steve notes, though, that the book doesn't cover Hadoop and Kerberos for developers. To help fill that gap, he's started a new gitbook ebook "Kerberos and Hadoop: The Madness beyond the Gate."


O'Reilly is offering readers of Hadoop Weekly a 20% discount on any pass to the upcoming Strata + Hadoop World with discount code HADOOPW. The conference takes place September 29 - October 1st in New York. See the link below for the agenda and speaker lineup.

In addition, Hadoop Weekly subscribers can enter a raffle for a free Bronze pass to Strata+Hadoop World. This pass gives access to all sessions, all keynotes, and more. To enter, visit the link below by July 22nd and provide your email address.


Cloudera has posted a list of upcoming milestones for Impala. The article lists features added in recent releases (2.0, 2.1, and 2.2), and the features planned for 2015 and 2016. Among the highlights, Impala will gain support for nested types and EMC Isilon in the near term, and the team is predicting >20x performance gains due to some updates planned for 2016.

MapR has announced that their bookings and billings grew 100% in Q2 2015 year over year. They're also touting multiple customers with $1 million+ in bookings.

At Strata + Hadoop World NYC, O'Reilly Media and Cloudera are introducing a new Developer Showcase for new open source projects. Applications are being accepted through August 14th.

Apache NiFi, which provides a UI for building dataflows for data routing and transformation, has graduated from the Apache incubator.

ZDNet has an update on some news related to Apache Software Foundation projects. Specifically, this week there were new releases of Apache Atlas and Apache Parquet, and the Apache Whirr was retired to the ASF attic.

Crunch pegs itself as the "Practical Big Data Conference." It takes place in Budapest on October 29-30th and speakers include folks from Cloudera, Pinterest, SurveyMonkey, and Spotify. The lineup looks great, and it's one of the more reasonably priced big data conferences.


Apache Atlas (incubating), which provides data governance for Hadoop, released version 0.5.0-incubating. The release adds a project website and the ability to parameterize schema API queries.

On the heels of SparkR, which was released as part of Spark 1.4, Databricks has added support for R notebooks to its hosted Spark offering. The integration supports installation of custom R libraries and ggplot2 for visualizations.

Cloudera Enterprise 5.2.6 was released with fixes to YARN, HDFS, and Cloudera Manager.

Amazon Redshift has added support for loading data stored as Apache Avro files.

Qubole has open-sourced a Hive JDBC storage handler, which provides the ability to back a Hive table by a JDBC data source. The introductory blog post describes how the handler implements predicate pushdown, describes how it splits the input, and provides an example hive table declaration.

Druid, the columnar distributed data store, version 0.8.0 was released. The new release includes new compression support, new aggregators, various improvements and bug fixes, and more. The release notes include upgrade instructions.

Apache Spark 1.4.1 was released this week. It includes a large number of stability fixes in areas such as the DataFrames and Data Sources APIs, MLLib, PySpark, and SparkR.

Qubole, makers of a Hadoop/Spark/Hive/Presto as a Service platform for Amazon Web Services, have open-sourced a connector for Presto and Amazon Kinesis (which is managed service, in some ways similar to Kafka, for streaming data sets). A post on the AWS big data blog describes the integration and gives a walkthrough for trying it out.

Apache Parquet MR 1.8.0 was released this week. It fixes some important bugs (including one that could lead to corruption), and it includes some backwards incompatible changes to the APIs.

Apache Lens 2.2.0-beta-incubating was released. Lens provides a unified system for doing data analytics across many data sources including Hadoop. The new release includes bug fixes, improvements, integration with Apache Zeppelin, CLI improvements, and more.

In other Apache NiFi news, version 0.2.0-incubating was released this week. This version includes a number of bug fixes and improvements, including a NiFi Storm Spout, MongoDB processors, and a database connection polling controller service.


Curated by Datadog ( )



Jay Kreps: Apache Kafka and the Rise of the Stream Data Platform (Newark) - Monday, July 20

Building Machine Learning Applications with Sparkling Water (Mountain View) - Tuesday, July 21

July Samza Meetup (Mountain View) - Tuesday, July 21

A New Way to Run Cassandra (San Francisco) - Wednesday, July 22

Fluentd, Docker and All That: Logging Infrastructure 2015 Edition (Redwood City) - Thursday, July 23


11 Almost-Truisms about Data, with Paco Nathan (Seattle) - Friday, July 24


Scaling Spark Workloads on YARN (Denver) - Wednesday, July 22

New York

Lifting the Hood on Spark Streaming (New York) - Thursday, July 23


Apache Spark, Spark + Cassandra, Cassandra 2.2/3.0 (London) - Wednesday, July 22


Machine Learning in Scala with Spark (Milan) - Thursday, July 23


Yarn vs Mesos Deathmatch & Scalding over Hadoop (Tel Aviv-Yafo) - Thursday, July 23