Hadoop Weekly Issue #230

27 August 2017

Lots of releases this week, including a useful new project out of Pinterest—DoctorKafka. On that note, it's Kafka Summit this week in San Francisco so please send any interesting slides my way. But for this week's issue—great posts on Beam, Hive, Redshift, and more.

Technical

Apache Beam is implement is implementing a new API, called Splittable DoFn, that should provide major benefits to users and to those building IO connectors. This post describes the motivation, the API, and the current status of implementation within the Beam codebase.

https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html

Access control in a big data setup can be challenging—striking the right balance between open access and strict ownership/access control. The Databricks blog demonstrates a few common patterns for coarse grained data access in AWS, including full separation of production/prototyping data and collaborative access in which prod data can be read for prototyping.

https://databricks.com/blog/2017/08/23/best-practices-for-coarse-grained-data-security-in-databricks.html

The Hortonworks blog has a post on how to use the Hive MERGE command to maintain slowly changing dimensions. It has examples for Type 1, Type 2, and Type 3 update strategies.

https://hortonworks.com/blog/update-hive-tables-easy-way-2/

Amazon Redshift Spectrum is a mechanism for running Redshift queries over data in S3 without loading it into a Redshift cluster. Spectrum supports data stored in S3 as Apache Parquet, and it takes advantage of the predicate push-down and column filtering capabilities of Parquet. This post describes how to query data across Redshift and S3 with Spectrum and some best practices for data loading and query tuning.

https://aws.amazon.com/blogs/big-data/from-data-lake-to-data-warehouse-enhancing-customer-360-with-amazon-redshift-spectrum/

Databricks has compiled a list of video, blogs, podcasts, and more that cover Apache Spark's Structured Streaming.

https://databricks.com/blog/2017/08/24/anthology-of-technical-assets-on-apache-sparks-structured-streaming.html

The latest version of HUE supports running Apache Sqoop 1 to import data into HDFS and Hive via the UI. This post walks through the steps necessary to get going.

http://gethue.com/importing-data-from-traditional-databases-into-hdfshive-in-just-a-few-clicks/

News

Apache MADlib, which is a machine library for big data SQL engines, has graduated from the Apache incubator.

https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces17

Releases

Pinterest runs over 1000 Kafka brokers, and thus has faced scalability challenges when it comes to replacing failed brokers, balancing workloads, and other operational tasks. They've built DoctorKafka for automating many of these tasks and providing insights into the status of the cluster via a web UI. This week, they have open-sourced the project.

https://medium.com/@Pinterest_Engineering/open-sourcing-doctorkafka-kafka-cluster-healing-and-workload-balancing-e51ad25b6b17

Version 2.7.0.0 of the StreamSets data collector was released. Highlights of the release include connectors for Google Cloud, change data capture for SQL Server, a JMS destination, integration with Cloudera Navigator for lineage tracking, and an Amazon S3 executor.

https://streamsets.com/blog/announcing-data-collector-v2-7-0-0/

Version 2.1.0 of Apache Kylin, the OLAP engine for Hadoop, was released. The new release adds support for RDMBS data sources, project level query authorization, and over 100 bug fixes and improvements.

https://lists.apache.org/thread.html/021f340551b7816dbf7f0a2a604a90aae1e043dd7ebb9ff86e174faf@%3Cannounce.apache.org%3E

Apache Knox 0.13.0 was released and includes a number of new features—Kafka REST API integration, Spark Thriftserver UI support, Apache Atlas Proxying, and many bug fixes/improvements.

https://lists.apache.org/thread.html/53a7b2ca0f7251625bf55584fa6c787dd034ca8466cc5b0aa03416c5@%3Cannounce.apache.org%3E

The 2.1.0 release of Apache Beam was announced this week. There are new APIs for AmqIO, CassandraIO, & HCatalogIO, initial support for streaming in the Python DirectorRunner, and a number of fixes/improvements.

https://lists.apache.org/thread.html/baccc53b0a207cc04cce256f3b194b887983d73a2ca1881c051a2f3b@%3Cuser.beam.apache.org%3E

Apache HBase 1.1.12 is a bug fix release containing 10 fixes, including a few correctness issues.

https://lists.apache.org/thread.html/e8d1ccaeb521ff2782aa73f4a4d985e3892c01b067292744b826e636@%3Cannounce.apache.org%3E

Version 2.3.0 of Apache KafkaStream, which is a port of the Java kafka-streams for node.js, has been released. The project is still considered to be under active development and not ready for production use.

https://www.npmjs.com/package/kafka-streams

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Apache Kafka Summit Meetup (San Francisco) - Tuesday, August 29
https://www.meetup.com/Reactive-Systems/events/242744831/

Washington

Spark Structured Streaming: Introduction and Internals (Bellevue) - Wednesday, August 30
https://www.meetup.com/Seattle-Data-Science-and-Data-Engineering/events/241418432/

Virginia

Cyber Security & Apache Metron (Vienna) - Wednesday, August 30
https://www.meetup.com/futureofdata-nova/events/242357970/

Maryland

Streaming with Heron, Mesos, and Aurora (Halethorpe) - Wednesday, August 30
https://www.meetup.com/Data-Science-MD/events/241712099/

IRELAND The SMACK Stack (Dublin) - Wednesday, August 30
https://www.meetup.com/Dublin-Apache-Kafka-Meetup-by-Confluent/events/242246982/

GERMANY

Hadoop 3.0: Revolution or Evolution? (Berlin) - Thursday, August 31
https://www.meetup.com/codecentric-Berlin/events/241641520/

SWITZERLAND

Automated Native Spark Modelling in a Managed Hadoop-as-a-Service Environment (Zurich) - Monday, August 28
https://www.meetup.com/spark-zurich/events/242330883/

ROMANIA

Spark v2.2 Workshop (Bucharest) - Friday, September 1
https://www.meetup.com/The-Bucharest-Agile-Software-Meetup-Group/events/242350581/

INDIA

Testing Spark and Scala (Bangalore) - Saturday, September 2
https://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/242214905/

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit https://hadoopweekly.com

Data Eng Weekly