Data Eng Weekly

Hadoop Weekly Issue #150

20 December 2015

Lots of folks were working hard to ship new projects and releases before the end-of-year holiday—Apache NiFi, Apache Drill, Apache Hadoop, and many more had releases this week. Similarly, there are technical articles about features of several new releases (NiFi, Kafka, and Yetus). In news, there's information about several big data conferences, and another year-in-review article—on Flink. All-in-all, lots of great content for the last issue of the year (I'll be taking a break next week).


TitanDB is a distributed graph database with pluggable storage backends. AWS has built a backend for the Amazon DynamoDB service. This post describes how to use Titan and DynamoDB to perform shortest path queries and presents performance comparison across different storage models. It also shows some of the features that are unique to the setup, such as CloudWatch metrics for performance monitoring.

Apache NiFi 0.4.0 (more about the release below) adds support for interfacing with Syslog and HBase. This tutorial shows how to configure NiFi with an HBaseClient, a ListenSyslog source, and a HBase writer to store syslog JSON data to HBase. This is a nice tutorial of a real-world use case for NiFi, including suggestions for improving performance in order to productionize the setup.

The Cloudera blog has a post on some recent improvements to performance of Hadoop's DistCP utility. The post describes how DistCP (a distributed copy within or across HDFS clusters) works, how performance is improved with HDFS snapshots, and how a new method of computing the list of files to copy can improve setup time. If you've worked with DistCP on a non-trivial HDFS, performance improvements are likely much welcomed.

The Confluent blog has a good introduction to Kafka Connect, the new framework for loading data into and out of Kafka. It shows how to use the JDBC driver to load row-level changes into Kafka without writing any custom code. From there, data is loaded into HDFS and made available via Hive. The post also discusses some of the advanced features, such as schema migration and partitioning.

The Altiscale blog has a nice introduction to Apache Yetus, which provides automation for testing patches, producing api documentation, and generating release notes for software project. Yetus was originally part of the Hadoop project and is used by several other ecosystem projects.


The ASF's Travel Assistance Committee announced that they're accepting applications for ApacheCon North America, which takes place in Vancouver in May.

The community-selected talks for the upcoming Hadoop Summit Dublin have been announced. The conference takes place 13-14 April 2016.

The Big Data Technology Summit 2016 will be held in Warsaw, Poland on 24-25 February 2016.

Metron is a new Apache incubator project that came out of the OpenSOC security analytics project. It includes tools for full-packet capture, stream and batch analysis, and much more.

The Apache Flink blog has a post recapping the year of Flink and plans for 2016. Looking back, it highlights community growth, the first Flink conference—Flink Forward, several of the articles about Flink, and major features added throughout the year. Looking forward to next year, planned features include runtime scaling of streaming jobs, SQL queries for static data, security improvements, support for Apache Mesos, and more.

Historically, Hadoop has targeted commodity hardware with local disks. More recently, separating compute and storage has gained momentum (particularly in the cloud, but also in the data center). The BlueData blog elaborates that position in a post that also discusses virtualized Hadoop, several relevant recent studies, and more.


Apache NiFi 0.4.0 was released this week. NiFi is a system for processing and distributing data with a web-based user interface. The new version includes support for LDAP authentication, usability improvements, support for several new systems—sftp, HBase, Azure Event Hub, Couchbase, and more.

Cloudera has released a new Spark library for analyzing time series data sets. An introductory post describes some key concepts of time series data and provides a brief introduction to the new library. Full documentation and the code (Spark-TS has Python and Scala bindings) are available on github.

Apache Drill 1.4.0 was released this week. It includes a number of bug fixes and improvements.

The Apache Myriad (incubating) project has released version 0.1. Myriad is a Apache Mesos framework for running Apache Hadoop YARN on Mesos. The Mesosphere blog has more details on Myriad, including a presentation that provides an introduction to the system.

IBM has released version 4.1 fix pack 2 of their distribution. This release adds support for SUSE Linux, Spark 1.5.1, and improvements to the Big SQL and Text Analytics products.

If you've administered Hadoop, you've probably gone through the task of freeing up space on HDFS. The HdfsUtils project provides command-line tools to help simplify these tasks. hdfind is similar to the unix find util, and hdls is an improved version of hadoop fs -ls. HdfsUtils is written in ruby and uses the WebHDFS REST APIs.

The second beta release of Cloudera's RecordService, the role-based access control system, was announced. The new version integrates with Apache Sentry (incubating) to enforce column-level privileges. A post on the Cloudera blog shows how to integrate RecordService with MapReduce and Spark, including the new column-level security.

Hortonworks has released version 1.1 of Hortonworks DataFlow, the data collection and management platform. The new releasee includes enhanced security, support for additional data sources/sinks, and more.

Qubole has added support for Apache Spark for the Qubole Data Service on Google Cloud Platform.

Streamliner is a new open-source tool from the MemSQL team for performing real-time ETL. It's built with Apache Spark Streaming and integrates with Apache Kafka and MemSQL. ETLs are defined/built with a UI, and Streamliner has an integration for automatically converting Thrift-encoded data to rows in a MemSQL database.

Cloudera released version 5.4.9 of Cloudera Enterprise. It includes a number of fixes across CDH, Cloudera Manager, and Cloudera Navigator.

Apache Hadoop 2.6.3, the latest bug-fix release in the 2.6.x branch, was released this week. It resolves 35 issues.


Curated by Datadog ( )


Hadoop Ecosystem (Ankara) - Wednesday, December 23

Spark & Docker (Istanbul) - Sunday, December 27


Spark @ Hsinchu First Meetup (Hsinchu) - Wednesday, December 23


More Than Slideware: An Insider’s Perspective on Real-World Big Data Use Cases (Walsh Bay) - Monday, December 21