Data Eng Weekly

Hadoop Weekly Issue #229

20 August 2017

Lots of great technical content this week including coverage of new features in Apache Hive and the new AWS Glue service. There are also a couple of good tutorials on Kafka Connect and connecting Apache Superset to Druid, and a great article on how to think about multi-tenancy via a large cluster or many small ones.


Apache Hive's new transactional update feature enables a whole new set of use cases. In addition to typical SQL UPDATE, Hive supports a MERGE command that enables upserts, repartitioning, masking of particular columns, and purging records. The Hortonworks blog has some examples covering each.

A great introduction to using Apache Kafka Connect for change data capture from a RDBMS (MySQL in this case). The post has instructions and example configs for getting going with this use case.

This post describes eleven different inputs into the technical/business decision of running multiple clusters or a single multi-tenant cluster for your organization (whether it be Hadoop, MySQL, or something else).

Apache TinkerPop is a distributed computing graph framework, and HGraphDB provides a HBase backend for TinkerPop. The Cloudera blog has a post that gives an overview of the HGraphDB database design and provides some examples (with code) for inserting and querying graph data using TinkerPop.

AWS Glue general availability was announced this week (more below). The first post describes how Glue can be used to transform files from JSON to Parquet as part of a data pipeline. The second shows some of the basics of Glue's python APIs and demonstrates integration with AWS Athena for querying.

The Qubole blog has a post on optimizing configuration parameters for cloud workloads. They describe how to perform an iterative experiment to find optimal job config values. This is expensive, though, so they instead collect statistics and model based on historical runs. The post has some key insights, such as optimizing the instance size, avoiding spills, and that Spark prefers fat executors.

This post describes how to setup and use Kerberos constrained delegation, which implements fine grained delegation (to particular services). For secure Hadoop clusters, this post is a detailed walkthrough of configuring a web app to execute queries on a cluster.

This post describes how to ingest (to Apache Kafka) click stream events using Divolte, process them with Druid, and visualize them with Apache Superset (incubating).


Streamlio has come out of stealth and announced their Series A financing and their real-time data processing product, which is build on Apache Pulsar (incubating), Heron, and Apache BookKeeper.


Amazon announced general availability of AWS Glue, which is a serverless ETL product. It uses PySpark under the hood—Glue generates Python code that can be customized. Also, it has a Hive-compatible metastore.


Curated by Datadog ( )



Double Talks: Data Science & Democratizing Data at Airbnb (San Francisco) - Tuesday, August 22

Bay Area Apache Spark Meetup @ Pinterest (San Francisco) - Tuesday, August 22

Hive User Group Meeting @ HortonWorks (Palo Alto) - Thursday, August 24


Spark + Deep Learning + Big Data Security & Governance (Denver) - Thursday, August 24


Spark Talks: Overview of Spark Streaming (Austin) - Wednesday, August 23

Integrating Real-Time Video Data Streams with Spark and Kafka (Austin) - Thursday, August 24

New York

Graph Technologies NYC (New York) - Tuesday, August 22


August Lightning Talks! (London) - Wednesday, August 23


Doug Cutting, Creator of Hadoop (Sao Paulo) - Monday, August 21

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit