27 August 2017
Lots of releases this week, including a useful new project out of Pinterest—DoctorKafka. On that note, it's Kafka Summit this week in San Francisco so please send any interesting slides my way. But for this week's issue—great posts on Beam, Hive, Redshift, and more.
Apache Beam is implement is implementing a new API, called Splittable DoFn, that should provide major benefits to users and to those building IO connectors. This post describes the motivation, the API, and the current status of implementation within the Beam codebase.
https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html
Access control in a big data setup can be challenging—striking the right balance between open access and strict ownership/access control. The Databricks blog demonstrates a few common patterns for coarse grained data access in AWS, including full separation of production/prototyping data and collaborative access in which prod data can be read for prototyping.
The Hortonworks blog has a post on how to use the Hive MERGE command to maintain slowly changing dimensions. It has examples for Type 1, Type 2, and Type 3 update strategies.
https://hortonworks.com/blog/update-hive-tables-easy-way-2/
Amazon Redshift Spectrum is a mechanism for running Redshift queries over data in S3 without loading it into a Redshift cluster. Spectrum supports data stored in S3 as Apache Parquet, and it takes advantage of the predicate push-down and column filtering capabilities of Parquet. This post describes how to query data across Redshift and S3 with Spectrum and some best practices for data loading and query tuning.
Databricks has compiled a list of video, blogs, podcasts, and more that cover Apache Spark's Structured Streaming.
The latest version of HUE supports running Apache Sqoop 1 to import data into HDFS and Hive via the UI. This post walks through the steps necessary to get going.
http://gethue.com/importing-data-from-traditional-databases-into-hdfshive-in-just-a-few-clicks/
Apache MADlib, which is a machine library for big data SQL engines, has graduated from the Apache incubator.
https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces17
Pinterest runs over 1000 Kafka brokers, and thus has faced scalability challenges when it comes to replacing failed brokers, balancing workloads, and other operational tasks. They've built DoctorKafka for automating many of these tasks and providing insights into the status of the cluster via a web UI. This week, they have open-sourced the project.
Version 2.7.0.0 of the StreamSets data collector was released. Highlights of the release include connectors for Google Cloud, change data capture for SQL Server, a JMS destination, integration with Cloudera Navigator for lineage tracking, and an Amazon S3 executor.
https://streamsets.com/blog/announcing-data-collector-v2-7-0-0/
Version 2.1.0 of Apache Kylin, the OLAP engine for Hadoop, was released. The new release adds support for RDMBS data sources, project level query authorization, and over 100 bug fixes and improvements.
Apache Knox 0.13.0 was released and includes a number of new features—Kafka REST API integration, Spark Thriftserver UI support, Apache Atlas Proxying, and many bug fixes/improvements.
The 2.1.0 release of Apache Beam was announced this week. There are new APIs for AmqIO, CassandraIO, & HCatalogIO, initial support for streaming in the Python DirectorRunner, and a number of fixes/improvements.
Apache HBase 1.1.12 is a bug fix release containing 10 fixes, including a few correctness issues.
Version 2.3.0 of Apache KafkaStream, which is a port of the Java kafka-streams for node.js, has been released. The project is still considered to be under active development and not ready for production use.
https://www.npmjs.com/package/kafka-streams
Curated by Datadog ( http://www.datadog.com )
Apache Kafka Summit Meetup (San Francisco) - Tuesday, August 29
https://www.meetup.com/Reactive-Systems/events/242744831/
Spark Structured Streaming: Introduction and Internals (Bellevue) - Wednesday, August 30
https://www.meetup.com/Seattle-Data-Science-and-Data-Engineering/events/241418432/
Cyber Security & Apache Metron (Vienna) - Wednesday, August 30
https://www.meetup.com/futureofdata-nova/events/242357970/
Streaming with Heron, Mesos, and Aurora (Halethorpe) - Wednesday, August 30
https://www.meetup.com/Data-Science-MD/events/241712099/
IRELAND
The SMACK Stack (Dublin) - Wednesday, August 30
https://www.meetup.com/Dublin-Apache-Kafka-Meetup-by-Confluent/events/242246982/
Hadoop 3.0: Revolution or Evolution? (Berlin) - Thursday, August 31
https://www.meetup.com/codecentric-Berlin/events/241641520/
Automated Native Spark Modelling in a Managed Hadoop-as-a-Service Environment (Zurich) - Monday, August 28
https://www.meetup.com/spark-zurich/events/242330883/
Spark v2.2 Workshop (Bucharest) - Friday, September 1
https://www.meetup.com/The-Bucharest-Agile-Software-Meetup-Group/events/242350581/
Testing Spark and Scala (Bangalore) - Saturday, September 2
https://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/242214905/
If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit https://hadoopweekly.com