Data Eng Weekly

Hadoop Weekly Issue #122

24 May 2015

This week's edition has great technical content describing best practices for Apache Cassandra and Apache HBase, testing in distributed systems, how YARN uses Linux cgroups, and much more. There are also a number of high-profile releases this week—Apache Drill 1.0, Apache Hive 1.2, Cascading 2.7, and a new project dpylr-spark. There's lots to catch up on for those folks with a long-weekend in the US!


The blog has a post describing several hard-fought lessons-learned when deploying Cassandra for timeseries data at scale. These include understanding CQL, COMPACT STORAGE, Cassandra Counters, row sizes, and more. The end of the post has some pointers to documentation of Cassandra best practices, which is aimed at anyone just getting started.

Testing a distributed system is difficult, and it's a problem that hasn't gotten a lot of attention. This post looks at some of the work that's been done, defines what it means for a distributed system to "work," describes tools for testing distributed systems, and details the types of tests that are part of the Apache Slider framework (among which is the Slider Integral Chaos Monkey).

This presentation from Strata & Hadoop World in London describes the architecture of Apache Flink. Topics covered include the Flink engine, data streaming analysis, Flink streaming (APIs, windowing, checkpointing), Flink for batch processing, memory management, and Flink's ML and Graph libraries.

The Dynamic Yield engineering blog has a post on Apache HBase, which enumerates four best practices. These are: ensuring good key-distribution, generating HBase-friendly ids, working with HBase snapshots, and when to use (or not use) HBase for real-time analytics.

This post, the first in a series, looks at the advantages of using schemas for data (vs CSV, JSON, XML, etc). It also describes why a serialization format, such as JSON, isn't enough by itself.

With traditional linux packaging, only a single version of a package is installed at a time. But when doing rolling upgrades, it's best to have multiple version installed in order to minimize downtime. The Hortonworks blog describes how HDP 2.2 installs multiple versions simultaneously by using RPMs and Debs which include the version number. HDP also includes a tool to update symlinks to activate a specific version.

The Altiscale blog has kicked off a blog series containing tips for running Spark on Hadoop with an overview of the three common ways to invoke Spark. These are local mode (single JVM), YARN cluster (spark-submit), and YARN client (distributed spark-shell). The post describes when each mode is most appropriate.

This post looks at using Sqoop to import and export data from MySQL to Hive. It includes examples and describes several caveats related to metadata (e.g. an export can only work on files in HDFS not on a Hive table as described in the Hive metastore).

The Hortonworks blog has a guest post about benchmarking Hive 0.11, 0.13, 0.14 against two vendors. There are a number of interesting things about this post: seeing the speedup on a real-world use-case from Hive 0.11 to 0.13 to 0.14, Hive hold its weight against multiple vendors, and how the author collects metrics to evaluate query performance and bottlenecks.

The Financial Information eXchange (FIX) is a delimited, key-value pair format. This post describes how to query data in the format using Hive and Impala. There are several tricks and advanced Hive features demonstrated in the post, such as defining a table with various "TERMINATED BY" declarations and building a view.

The Hortonworks blog has a post describing how YARN uses Linux cgroups to ensure CPU isolation when vcore resource allocation is enabled. The post describes how (at a high-level) cgroups work, the basic configuration, advanced configuration (e.g. hard vs. soft limits), and provides some examples of the cgroup hierarchy.

This presentation from the recent Gluecon describes the data platform at FullContact. The platform has moved from a batch-only system to also have a real-time component powered by Apache Kafka and Apache Crunch. Given that they already had implementations of their algorithms in Crunch, they are using Crunch's in-memory runner to process data in micro-batch directly from Kafka.


This post provides an overview of themes from the recent Strata + Hadoop World London and the BigAnalytics Israel conferences. Real-time and stream processing seem to be big topics, as does a longing for Hadoop to be more featureful.

Apache Drill 1.0 and Apache Hive 1.2, which are just two of the many SQL-on-Hadoop engines, were both released this week (more details on those below). ZDNet notes that Drill is fundamentally different from most other systems (it doesn't require a metastore and aims to run on any data), and it discusses some of the recently added Hive features (from the Stinger initiative) as well as the Hive-on-Spark project.

Stata + Hadoop World is taking place in Singapore on December 1-3. The call for proposals is open until June 18th.

Cloudera is moving much of Impala development into the open both to keep the community informed of progress and to make it easier for developers to contribute to the project. There's more information on the initiative, as well as a link to a docker image containing the development environment, in a post on the Cloduera blog.


Apache Drill version 1.0 was released this week. The Drill team highlights this release as production ready, and it has addressed over 200 issues since the last release. Highlights of the 1.0 release include improvements to stability and performance, improved JDBC compatibility, and lots of improvements to documentation. See the Drill blog for more details.

Scalding 0.14.0 was released. The new version includes a new local mode, improvements to the typed api (a TypedPipeDiff and exposing the make() method to produce a store), and a fix to skewJoinWithSmaller.

Cascading 2.7, which is the last minor release planned before Cascading 3.0, was released this week. It includes several new features and fixes, such as better support for small files and capturing more details of a failure in a Trap.

The Twitter engineering blog has a post on Apache Parquet, which discusses the graduation of the project from the Apache incubator and the recent 1.7.0 release. Among the features of the release are: a new filter API for Java and Scala, a memory manager to help avoid out of memory issues, improved support for evolving schemas, and improved interoperability with Hive and Avro. There's also discussion of future work, like support for zero-copy reads in Hadoop and a vectorized read API.

Apache Hive 1.2.0 was released this week. It includes SQL enhancements ("union unique" and insert with column list), performance improvements (predicate pushdown enhancements, caching of stats in HiveServer2, count distinct speedups, and more), security improvements, and improvements to usability.

Keystone ML is a machine learning framework from the UC Berkeley AMPLab. The tool is similar to, with some additional features such as support for processing images, text, and speech. The library is written in Scala.

Apache Accumulo 1.7.0 was released. Accumulo adheres to semantic versioning, so 1.7.0 aims to be source-compatible with other 1.x releases, although there are a number of updated requirements (java7, Hadoop 2.2.0+, and Zookeeper 3.4.x+). The key features of the release are client auth via Kerberos, Data-Center replication, user-initiated compaction strategies, support for distributed tracing, and performance improvements. There are a lot of updates in the full release notes.

Apache NiFi 0.1.0-incubating was released. The release contains a number of bug fixes and improvements, and it has new features to support Amazon S3, PGP/GPG encryption, and more.

Version 0.80-incubating of Apache Slider was released. This version supports Docker for application packaging, supports accumulo, adds reconfigure support, and much more.

dplyr-spark is a new project to provide a Spark backend for dplyr. The package is in beta and can be installed from source.


Curated by Datadog ( )



Apache Drill v1.0, Apache Kylin & More (Los Gatos) - Tuesday, May 26

Spark + Cassandra: Working Together for Good (Sacramento) - Wednesday, May 27

Fast Big Data Analytics with Spark on Tachyon in Baidu (San Jose) - Thursday, May 28


Intro to MLlib (Boulder) - Wednesday, May 27


PySpark, IPython Notebook, and SparkSQL as an Environment for Data Science (Saint Paul) - Thursday, May 28


Apache Hadoop Security, Today and Tomorrow (King of Prussia) - Wednesday, May 27


HBase Cache and Read Performance (Boston) - Thursday, May 28


Flink Meetup: Juggling Bits & Bytes, plus Flink Troubleshooting (Berlin) - Wednesday, May 27


Apache Spark Streaming with Elasticsearch in Privredna Banka Zagreb (Zagreb) - Wednesday, May 27


Spark Integration and Velocity (Istanbul) - Wednesday, May 27


Data Plumbing Blues: 3 Real-Life Examples (Tel Aviv-Yafo) - Wednesday, May 27


Introduction to Spark SQL (Bangalore) - Saturday, May 30

Designing Big Data Systems: Spark (Bangalore) - Saturday, May 30