Data Eng Weekly

Hadoop Weekly Issue #188

25 September 2016

Lots of releases this week—CouchDB, Accumulo, Kylin, Osso (a new OSS project from Rocana—but most notably Apache Kudu hit version 1.0. There's a bit less technical content and general news than usual, but that's to be expected. With Strata + Hadoop World taking place this week in NYC, get ready for tons of news in the next issue.


The Cloudera blog has a post on the recently released Apache Hadoop 3.0.0-alpha1. It describes several of the features of the release, including HDFS erasure coding, v.2 of the YARN Timeline Service, and the shell script rewrite.

MapR has posted a whiteboard walkthrough on how Apache Flink handles event time for stream processing. In addition to the video, there's a transcript of the presentation.

This post is a great walkthrough of Apache Drill. It covers a bunch of topics, including: quoting reserved keywords, interpreting/fixing json parse errors, use of subqueries, conveniences for querying csv, a basic overview of Drill's web interface, plugin configuration, querying a rdbms, and analyzing a query plan.

Cloudera has published a post comparing Apache Impala and Amazon Redshift. There's an overview of key differences, but the main focus is a performance and cost comparison. As always, these results shouldn't be viewed as necessarily representative (each dataset is different). With that said, using a TPC-DS derived workload, they show that Impala can often beat Redshift in cost and performance.

The StreamSets blog has a post arguing that Apache Kudu's support for efficient real-time access and atomic updates provides an alternative to the lambda architecture.

This post describes some of the challenges of moving a data science research project into a production data pipeline. The author argues that it's important for developers and data scientists to work together to integrate quickly.


IBM Power systems are getting support for Apache Hadoop through an IBM partnership with Hortonworks.

dataArtisans have announced the dA Platform, which is a distribution of Apache Flink with enterprise support.

Oracle and Qubole announced a partnership to bring the Qubole big data as a service offering to the Oracle Cloud Platform.

Omid is a transaction manager for Apache HBase that was recently accepted into the Apache Incubator after a proposal from Yahoo. It both provides snapshot isolation guarantees and can be used in high performance environments (supporting over 100k transactions/second).


Rocana has open sourced Osso, which is a new semi-structured event format. Built on Avro, the standard is meant to be easy, intuitive, efficient, and complementary to existing solutions.

The Google Cloud Platform blog has highlighted three integrations related to Kafka. The Google Cloud Pub/Sub connectors offer a mechanism for moving data between pub/sub and Kafka, the KafkaIO connector for Apache Beam allows Beam systems to consume from Kafka, and the Kafka to BigQuery connector can be used to mirror data to BigQuery.

Version 2.0 of Apache CouchDB was released this week. Highlights of the release include new clustering, a new querying language, and a rewritten admin interface.

Apache Kudu announced version 1.0 this week. The release includes support for HA Kudu Master, a rewritten Apache Spark integration, an official client library for Python, and more. To mark the occasion, the Cloudera blog has an overview of the history of the project and a look at its future.

Apache Accumulo 1.6.6 includes a data loss fix, a fix for DataNode decommission, dependency upgrades, and more.

Amazon EMR now supports security configurations to enable encryption for data at rest and in transit. The post has an example of configuring the encryption providers.

Version 1.5.4 of Apache Kylin, the OLAP engine for Hadoop, was released.

Amazon Web Services has open-sourced the Amazon EMR-DynamoDB connector.


Curated by Datadog ( )



Apache Spark Meetup (San Francisco) - Tuesday, September 27

Azure 101: Hadoop on Cloud (Mountain View) - Wednesday, September 28


Scaling Recommenders + Content Embeddings at Facebook (Seattle) - Wednesday, September 28


Apache Nifi (Lafayette) - Monday, September 26


Hadoop Security and Governance with Apache Ranger and Apache Atlas (Manhattan) - Wednesday, September 28


Big Data & Data Science Workshop Using Apache Spark (Houston) - Monday, September 26


Diving Into Big Data Technologies: Hadoop, Hive, and Apache NiFi (Atlanta) - Thursday, September 29

District of Columbia

“Data Analytics with Hadoop” Book Release Celebration (Washington) - Monday, September 26

New York

HBaseCon East 2016 (New York) - Monday, September 26

Intro to Apache Kudu: Fast Analytics on Fast Data (New York) - Tuesday, September 27

The Stream Processor as a Database (New York) - Wednesday, September 28


Let’s Get Started with Hadoop #9 (Oslo) - Thursday, September 29


Criteo Labs Tech Talks Session 3 (Paris) - Wednesday, September 28


Introduction to Apache Flink (Amsterdam) - Thursday, September 29


Data Engineering on AWS by Thorsten Greiner (Dusseldorf) - Thursday, September 29


Hands-On Introduction to Apache Spark & Apache Zeppelin (Gdansk) - Wednesday, September 28


Practical Distributed Stream Processing with Akka Streams (Tel Aviv-Yafo) - Tuesday, September 27


Discuss Key Emerging Big Data Technologies (Bangalore) - Thursday, September 29

Introduction to Hadoop, Yarn, HDFS Students Only - Friday, September 30