Data Eng Weekly


Data Eng Weekly Issue #276

05 August 2018

Lots of stream processing this week, and a couple of posts on moving data out of Kafka for batch processing. There's also some breadth to the coverage—Apache Atlas, Apache Hadoop 3.1, and several releases.

Sponsor

Built by narwhals, Dremio is the Data-as-a-Service platform that simplifies data engineering and data analytics. Accelerate your queries up to 1,000x. Provide your BI and data science users a self-service experience. Open source.

Visit https://bit.ly/about-dremio to learn more, or download for free.

Technical

Full end-to-end tutorial for Kafka Streams, including Gradle for builds, unit testing, and a ruby client that produces test data.

http://ericlondon.com/2018/07/26/kafka-streams-java-application-to-aggregate-messages-using-a-session-window.html

The Hive metastore supports tables whose partitions have differing serialization formats, which can be quite common as data is converted from a write-friendly format to a read-optimized one. This post walks through Spark internals and execution plans to determine how to get multi-format tables working in Spark.

https://blog.godatadriven.com/multiformat-spark-partition

An interesting look at bol.com's data ingestion journey in building a system to mirror data from Kafka to HDFS. They share some good lessons learned from implementing a Flink-based pipeline, and they describe the ultimate solution.

https://techlab.bol.com/how-not-to-sink-a-data-stream-to-files-journeys-from-kafka-to-parquet/

Databricks Delta provides some optimizations for working with data that go beyond core Apache Spark. This post describes, at a high-level, how two of them—data skipping and zorder clustering—work.

https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html

This is a good overview of the core data structures in Apache Atlas for storing governance data and how to record that data via the Atlas REST API.

https://medium.com/hashmapinc/apache-atlas-using-the-v2-rest-api-6f9be1c256ae

Apache Hadoop 3.1 adds first-class support for GPUs. This post describes how it detects GPUs, implements isolation, and more.

https://hortonworks.com/blog/gpus-support-in-apache-hadoop-3-1-yarn-hdp-3/

One of the main new features in Apache Pulsar 2.1 (more about that release below) is tiered storage. With it, data can be automatically or manually migrated to S3 or other blog storage. This post includes more about the new feature.

https://streaml.io/blog/tiered-storage-in-apache-pulsar/

This article has a great introduction to the Apache Flink streaming APIs and basic SQL primitives to work with streaming data. The accompanying code is available on Github.

https://www.infoworld.com/article/3293426/big-data/how-to-build-stateful-streaming-applications-with-apache-flink.html

Banzai have open-sourced a Logging Operator that uses Fluentd (and a Fluent-bit daemon) to implement centralized logging for Kubernetes.

https://banzaicloud.com/blog/k8s-logging-operator/

GO-JEK has written about their data pipeline for taking data from Kafka and putting it in cold storage. They've designed an at-least once delivery system that supports loading new Protobuf definitions without a service restart.

https://blog.gojekengineering.com/sakaar-taking-kafka-data-to-cloud-storage-at-go-jek-7839da20b5f3

This post is a great resource for understanding how Kafka brokers and clients discover the cluster topology, especially when using private networks and/or IPs. It covers the broker configuration settings, describes several common challenges, and provides examples for Kafka running in Docker and in AWS.

https://rmoff.net/2018/08/02/kafka-listeners-explained/

Just over a month ago, Dremio (disclosure: Dremio sponsors Data Eng Weekly) announced the Gandiva initiative to bring LLVM code generation speedups to Apache Arrow. The Gandiva project is now live, and their readme has a great overview of the basics of the code generation and optimizations that it implements so far.

https://github.com/dremio/gandiva/blob/c9d9f135ba3f5b00662fbcf048f9729b4edcc302/README.md

Apache Pulsar supports both synchronous and asynchronous replication across data centers. This post describes the tradeoffs of both, and the steps to enable replication.

https://medium.com/@pckeyan/apache-pulsar-geo-replication-ad4f0ca3224b

The dataArtisans blog has an overview of Broadcast State in Apache Flink. It's useful for efficiently join a stream of high-volume data with a low-volume source, as is illustrated with the example in their post.

https://data-artisans.com/blog/a-practical-guide-to-broadcast-state-in-apache-flink

Sponsor

Unravel is using machine learning to simplify Kafka operations and get the best performance, predictability, and reliability from Kafka-based applications. See how.

http://bit.ly/unravel-simplify-kafka-ops

News

The Data Protection Officer role might not be a new one, but it's under the spotlight thanks to GDPR. There's more about what kinds of orgs need a DPO and what is expected from it in this post.

https://www.datanami.com/2018/08/01/now-hiring-data-protection-officer/

Confluent announced that Distributed Masonry, the group behind the Onyx Platform and Pyrostore, is joining their team.

https://www.confluent.io/blog/welcoming-the-distributed-masonry-team-to-confluent/

This post has a good overview of the talks at the recent Apache Hadoop meetup in Bangalore that covered Ozone (the blob store api for HDFS), Myntra's data ingestion platform, scaling Hadoop at LinkedIn, Presto performance optimizations, and migrating to Hadoop 3.

https://hortonworks.com/blog/apache-hadoop-meetup-july-2018-bangalore-chapter/

Two moderate CVEs in Apache Kafka authentication were disclosed. The first allows clients to impersonate other users and the second allows clients to interfere with replication. Both can be mitigated by upgrading to the latest version on the various branches.

https://lists.apache.org/thread.html/843af59d4820692838fe5b02ddf006f81cc991efb7da5a3ef0b2e6e8@%3Cannounce.apache.org%3E
https://lists.apache.org/thread.html/0d991f38106fa32f6ed5cbda385c3f008cb76664c3c796e869b7d453@%3Cannounce.apache.org%3E

This article has a good set of recommendations for how to incorporate data ethics to the product and hiring processes.

https://www.oreilly.com/ideas/datas-day-of-reckoning

Jobs

Checkout the jobs from the Data Eng Weekly board:

Or, submit your own job at https://jobs.dataengweekly.com/

Releases

Spring Cloud Data Flow 1.6 is released. It includes improvements to the Cloud Foundry and Kubernetes deployments, UX improvements to the dashboard, and more.

https://content.pivotal.io/blog/spring-cloud-data-flow-1-6-is-ga-pcf-scheduler-integration-kubernetes-1-10-and-spring-boot-2-0-compatibility-ci-cd-improvements-and-more

Apache Kafka 2.0.0 is released with a bunch of changes including 40 KIPs. Major changes include improvements to ACLs, OAuth2 bearer tokens for authenticating Kafka brokers, improvements to the replication protocol, several Kafka client improvements, and more.

https://lists.apache.org/thread.html/2d988566b7cd68f94ed3e8b92cae93d9338dbeabe1578de6bc8f278c@%3Cannounce.apache.org%3E

Faust is a Python Stream Processing library that's based on Kafka Streams. It requires Python 3.6 as it's built on async/await. It supports several different types of aggregation/windowing and includes local-state support via RocksDB. Version 1.0.25 was just released.

https://github.com/robinhood/faust/releases/tag/1.0.25

Apache Flink 1.5.2 is out. It resolves several bugs and includes some improvements.

https://flink.apache.org/news/2018/07/31/release-1.5.2.html

Confluent Platform 5.0 was released. Based on Apache Kafka 2.0, it includes several features like LDAP integration, disaster recovery, support for user-defined functions in KSQL, and a MQTT Proxy.

https://www.confluent.io/blog/introducing-confluent-platform-5-0/

Apache Knox, which is a REST API gateway for Hadoop clusters, released version 1.1.0. There are a number o usability improvements across a total of 175 tickets.

https://lists.apache.org/thread.html/90b581bdda46fe04300983715d41204aecf592fe23740fa986f01f24@%3Cannounce.apache.org%3E

Version 2.1.0-incubating of the Apache Pulsar streaming platform is now available. The release has several new features like tiered storage, stateful functions, a Go client, and support for Avro and Protobuf schema support.

http://pulsar.apache.org/release-notes/#2.1.0-incubating

There were several dot releases with bug fixes to Apache Cassandra—2.2.13, 3.0.17, and 3.11.3.

http://mail-archives.apache.org/mod_mbox/cassandra-dev/201808.mbox/%3C626e17a4-eb80-515a-df96-0a90090c9fb2%40pbandjelly.org%3E
http://mail-archives.apache.org/mod_mbox/cassandra-dev/201808.mbox/%3C9e246487-b641-52b8-e851-5117b5cf0fb0%40pbandjelly.org%3E
http://mail-archives.apache.org/mod_mbox/cassandra-dev/201808.mbox/%3C330abb8d-0672-9865-6339-b4f35b48e910%40pbandjelly.org%3E

Hugegraph 0.7.4 was release. Hugegraph is a large scale OLTP graph database. It's built with compliance for APache TinkerPop and several backends like RocksDB, Cassandra, and MySQL.

https://github.com/hugegraph/hugegraph/releases/tag/v0.7.4

Sponsors

Built by narwhals, Dremio is the Data-as-a-Service platform that simplifies data engineering and data analytics. Accelerate your queries up to 1,000x. Provide your BI and data science users a self-service experience. Open source.

Visit https://bit.ly/about-dremio to learn more, or download for free.

Unravel is using machine learning to simplify Kafka operations and get the best performance, predictability, and reliability from Kafka-based applications. See how.

http://bit.ly/unravel-simplify-kafka-ops

Events

Curated by Datadog ( http://www.datadog.com )

California

Big Data Systems for Operational Analytics (San Francisco) - Wednesday, August 8
https://www.meetup.com/SF-Big-Analytics/events/252678379/

How Uber Uses Open Source Big Data Technologies (Palo Alto) - Wednesday, August 8
https://www.meetup.com/UberEvents/events/252916656/

Event-Driven Clojure Apps on Kafka and the Confluent Platform (Santa Monica) - Wednesday, August 8
https://www.meetup.com/Los-Angeles-Clojure-Users-Group/events/253215847/

Colorado

Gokhan Oner: Fast and Easy Big Data Stream Processing Boulder - Tuesday, August 7
https://www.meetup.com/BoulderJavaUsersGroup/events/236144884/

Denver - Wednesday, August 8
https://www.meetup.com/DenverJavaUsersGroup/events/246706183/

Data Engineering: Introduction to AWS Glue for ETL (Denver) - Wednesday, August 8
https://www.meetup.com/Denver-All-Things-Data/events/252683432/

AUSTRALIA

Using dplyr for ETL, Because SQL Is So Last Decade (Sydney) - Wednesday, August 8
https://www.meetup.com/R-Users-Sydney/events/253044564/