Data Eng Weekly

Hadoop Weekly Issue #151

03 January 2016

For the first issue of 2016, there's great content about Hive, Google's Sawzall, Kafka, Amazon EMR, and more. And there were quite a number of releases—Samza, Knox, and Kylin to name a few. All in all, lots to catch up on to kick off the new year.


The Unofficial Google Data Science blog has an article describing the migration from Sawzall to Go for data processing at Google. The post describes some of the auditing and access control features of Sawzall that ended up causing problems, which motivated the transition to a new solution. Given that Google tends to be at least a few years ahead of the open-source community, there are plenty of lessons to be learned from this post.

The AcadGild blog has recently published several highly relevant posts on big data systems. The content includes a series on Apache Hive, covering file formats (text, sequence, RC, ORC, and Avro), row-level ACID transactions (introduced as Hive 0.14), and indexing (when/why to use indexing, types of indexes—compact and bitmap, and how to create indexes).

Big Data & Brews has a three-part interview with Monte Zweben, CEO of Splice Machine. For each part, there's both a video and a transcript. Topics covered include HBase, Spark, and how Splice Machine's customers are using the Splice Machine Hadoop-based RDBMS.

LinkedIn is a source of inspiration for building world class data infrastructure, both because they've built and open-sourced several important projects and the transparency with which they write about their data systems. This post highlights the evolution of the LinkedIn data pipeline by extracting key images/diagrams (including descriptions) from several articles and presentations.

The Databricks blog has highlighted the top 10 posts (by page views) of the many articles that they published about Spark over the past year.

Until recently, it was difficult to access data in S3 from Amazon EMR on a private subnet in a VPC. But with the recent introduction of the managed NAT gateway, this has become an easier task. Yet, there are still a few challenges (such as setting up access to the cluster through the master node) that are covered in this post.

When using Kafka with Hadoop, Camus is one of the popular ways to load data into HDFS (or another Hadoop FileSystem). This post describes how Camus extracts timestamps from messages, and it discusses the several ways that timestamps are used by Camus during processing.


Datanami has an article describing the recently unveiled TCP-DS 2.0 benchmark. There are a number of changes to make the benchmark more applicable to big data system. These include removal of the ACID requirement, removal of the primary-key/foreign-key constraints, elimination of trickle updates, and an update to how the final score is calculated.

The O'Reilly Data Show Podcast has two new episodes about Apache Spark. The first is a discussion about the future of Spark and SparkR's APIs, and the second is a fireside chat with Ben Horowitz from Spark Summit along with a discussion about the rise of Apache Spark in China.


Version 0.7.0 of Apache Knox Gateway, the REST API Gateway for Hadoop, was released. This version improves webapp support by implementing X-Forwarded headers and CORS support. It also includes a large number of improvements (including performance improvements) and bug fixes.

Apache NiFi, the data processing and distribution system, released version 0.4.1. The new release addresses bugs and includes a few minor improvements.

Apache Chukwa, the log-based monitoring and analysis system, released version 0.7.0 this week. This release has a new dashboard design, a new Parquet file format, and new HBase and Solr support.

Oracle GoldenGate for Big Data was released this week. The system provides a mechanism to capture database transactions and store them in big data systems. The new release adds support for Apache Kafka, security enhancements, the ability to partition data in Hive tables, and much more.

Apache Samza 0.10.0 was released. Samza is a distributed stream processing framework built on Apache Kafka and Apache Hadoop YARN. The Apache blog has a recap of the release, including highlights (dynamic configuration, host-affinity, HDFS producer, ElasticSearch producer), known issues, and plans for future releases.

Apache Kylin, the open-soruce OLAP analytics engine for Hadoop, released version 1.2. The release adds support for Excel, Power BI, and Tableau 9.1, improves small file management on HDFS, and includes a number of bug fixes. The Kylin blog has two posts about the new features in the release.

reactive-kafka is a Reactive Streams wrapper for Apache Kafka. A new release adds support for Apache Kafka and Akka Streams 2.0.

Apache Atlas 0.6-incubating was released this week. The new release of the data governance and metadata framework resolves over 100 tickets.

jruby-kafka is a JRuby wrapper for the Kafka producer and high-level consumer APIs. Version 1.5.0 and 2.0 were released this week.


Curated by Datadog ( )



Kafka for DBAs + More (Palo Alto) - Tuesday, January 5


Spark and IoT (Golden Valley) - Thursday, January 7


Sparking Data in the Cloud (McLean) - Wednesday, January 6

District of Columbia

Moving from Microsoft SQL to Hive (Washington) - Thursday, January 7


Spark Meetup and LinkedIn's Pinot (Montreal) - Tuesday, January 5


Apache Flink: 3 Real-World Use Cases (Paris) - Thursday, January 7


Learn to Stream with Spark at AppsFlyer (Herzeliyya) - Monday, January 4

Surviving Black Friday & Turning Behavioural Signals Into User Profiles (Tel Aviv-Yafo) - Monday, January 4


Writing YARN Applications & Understand Partitioning in Apex (Pune) - Wednesday, January 6

Introduction to Apache Flink (Bangalore) - Saturday, January 9


Recruit Technologies Open Lab #2: Spark (Tokyo) - Monday, January 4