Data Eng Weekly

Hadoop Weekly Issue #175

19 June 2016

Hadoop Summit is a little over a week away, and we're already seeing a number of product announcements timed to coincide with the event. On the technical side, there are great posts on Hadoop Kerberos authentication and Avro at Salsify. And in terms of releases, there were several announcements including a new open-source columnar database from Yandex.


The OpenCore blog has an article that demonstrates a number of debugging tools for Hadoop's Kerberos authentication. Specifically, it shows how to use the main() method of UserGropuInformation to dump a bunch of useful debug information.

In part four in a series on YARN, the Cloduera blog looks at how to configure a fair scheduler queue. Specifically, the post describes settings for resource constraints, queue placement policies, and preemption.

Salsify is building an asynchronous micro-services architecture built on Apache Kafka with Apache Avro for data serialization. Their application is built in Ruby, and they've created several new tools to make Avro easier to work with in that language. This post describes the tools and their value: avro-builder for defining records, a postgres-based schema registry, and avromatic for generating models from avro schemas.

Apache Drill can infer schemas on the fly and also supports data with multiple (but compatible) scheams. This combination enables some interesting use cases, such as querying across multiple json files with differing schemas. A post on the MapR blog explores these feature and includes several examples.

This tutorial shows how to use Apache Kafka with Druid to build a streaming analytics and visualization (using Pivot, a web UI for Druid) application.

The Apache Beam (incubating) blog has a post describing some of the work that was done to make the Beam connector for Apache Flink work with Flink's batch runner. Beam is the open-source SDK, originally from Google, that exposes a backend-agnostic data pipeline API.

Cask Hydrator is a tool for building data pipelines using a drag and drop UI. This tutorial shows how to use Hydrator to export data from MySQL into HDFS.

Databricks has a post on the new SQL subquery support in the upcoming Apache Spark 2.0 release. Interestingly, the post is written as a notebook, which is a straightforward way to present code and example data.

The Apache Kudu (incubating) blog has a post on their use of Raft in single node clusters, which allows for dynamic scaling to multi-master clsuters.


This article points out that the Apache Spark community, if not careful, could suffer the same type of fragmentation that has caused confusion in the Apache Hadoop ecosystem. For example, the latest versions of CDH and HDP support different versions of Spark.

The New Stack has an article on Concord, which is a new stream processing framework (in public beta) built on Apache Mesos. Concord is written in C++ and supports dynamic topologies (scaling up/down parts of the pipeline without downtime).

On the heels of announcing general availability of Databricks Community Edition, Databricks has announced the first in a series of tutorials on writing Apache Spark applications using Databricks.

Hadoop Summit San Jose, which takes place in a few weeks, will feature a Women in Big Data Lunch and Panel. The Hortonworks blog features an interview with the panel moderator and Hortonworks CMO Ingrid Burton.


Apache SystemML (incubating) recently released version 0.10.0. SystemML is a machine learning framework built with multiple backend support, including Apache Spark and Apache Hadoop. This release includes new Spark Matrix Block types, support for deep learning, several performance enhancements, a new KNN algorithm, and much more.

Apache Mahout, another machine learning framework, has released version 0.12.2. This release makes some improvements towards the goal of integrating with Apache Zeppelin for visualization and notebook support.

Qubole has announced that their HBase-as-a-Service offering is now generally available on AWS. The offering has a number of nice features for a long-running cluster, supports Hannibal and other monitoring tools, integrates with Apache Zeppelin, and can be configured with OpenTSDB and Apache Phoenix through node bootstrap actions.

Altiscale has announced Altiscale Insight Cloud Real-Time Edition. The system is backed by Apache HBase and Spark Streaming.

hs2client is a new C++ library for Apache Hive and Apache Impala (incuting). In addition to C++, the library has python bindings with support reads to DataFrames in pandas.

MapR has announced an Apache Spark 2.0 Developer Preview for their distribution.

Apache Beam has announced the 0.1.0-incubating release, the projects first since joining the Apache incubator.

Yandex has open sourced ClickHouse, a columnar analytics database. The system is built for both horizontal and vertical scaling. It supports complex data types (e.g. arrays) and can do approximate queries. The team has also published benchmark results in comparison to several other databases.


Curated by Datadog ( )



Apache Bigtop & Apache Apex (San Jose) - Tuesday, June 21

Options for Ingest: Elasticsearch Ingest Node and Apache Airflow (Mountain View) - Wednesday, June 22

Open House: Big Data Processing with Apache Spark (San Francisco) - Thursday, June 23


Ted Dunning on Anomaly Detection (Redmond) - Thursday, June 23


Big Data at Twitter Scale (Boulder) - Thursday, June 23


June Tech Talk: PySpark (Austin) - Monday, June 20

Using Splunk for Your Big Data Along with Hadoop (Carrollton) - Tuesday, June 21

Invited Speaker Series: Cody Koeninger on Fundamentals of Spark and Kafka (Austin) - Thursday, June 23


Top 5 Mistakes When Writing Spark Applications (Chicago) - Monday, June 20

North Carolina

Apache Metron Overview and Codelab: Building the next Generation Cyber Security (Durham) - Thursday, June 23


High Concurrency, Low Latency Reporting Within Hadoop (Roswell) - Wednesday, June 22


Introduction to Spark In-Memory Computing (McLean) - Wednesday, June 22

Overview of Open Source Fast Data Platforms and Future Plans (Reston) - Wednesday, June 22

Interactive Data Analytics with Flink and Zeppelin (Reston) - Thursday, June 23

New York

Distributed ML in Spark (New York) - Friday, June 24


Solving Analytics Problems in the Cloud w/ Spark, Presto, Hive (Boston) - Tuesday, June 21


Big Data Show and Tell (Reading) - Wednesday, June 22


How Spark Can Improve Your Hadoop Cluster (Hamburg) - Wednesday, June 22


Apache Flink Crash Course: Meet the Squirrel (Prague) - Thursday, June 23


Third Spark Meetup (Pune) - Wednesday, June 22

Understanding and Building Big Data Architectures, Part 3: Messaging/Kafka (Hyderabad) - Saturday, June 25