Data Eng Weekly


Data Eng Weekly Issue #293

16 December 2018

Posts this week covering the circuit breaker pattern and distributed transactions for microservices, a deep dive on secure configuration in Apache Kafka, Trivago's move from Apache Hive to PySpark, a new open source library for JW Player to denormalize CDC stream data, and more. Several news articles, including the first of many year in review posts, and a smattering of releases round out the issue.

Technical

A look at some solutions for implementing multi-service consistency in a microservices architecture. The article describes how one might use both two-phase commit and the saga pattern, and the trade-offs of each solution.

https://developers.redhat.com/blog/2018/10/01/patterns-for-distributed-transactions-within-a-microservices-architecture/

The Datadog blog has a post on ActiveMQ covering its architecture (including the Java Message Service/JMS API and its message persistence model), monitoring ActiveMQ via its JMX, JVM, and host-level metrics, and monitoring metrics with Datadog. There are example alerts to detect out of memory, out of disk, runaway queues, and more.

https://www.datadoghq.com/blog/activemq-architecture-and-metrics/

Apache Spark 2.4 has a new API for reading image data into a data frame. The Databricks blog has an overview of the API, a description of the schema for image data, and some example code for building a deep learning pipeline that consumes images.

https://databricks.com/blog/2018/12/10/introducing-built-in-image-data-source-in-apache-spark-2-4.html

GraphIt is a new graph processing library and DSL that shows some great performance improvements over other frameworks. It's a research project out of MIT and Adobe Research, and the code is on github.

https://www.datanami.com/2018/12/10/graphit-promises-big-speedup-in-graph-processing/

Trivago's data science team migrated ML pipelines from Apache Hive to PySpark. They cite several reasons for the change, including Hive's SQL dialect, the challenge of implementing unit tests (they were instead testing on a replica of a production cluster), slow time to production, and poor developer tools. With Spark, they use R for analysis and PySpark for production. Not everything was smooth sailing, though—obstacles include problems with UDFs, optimizing joins, repartitioning data, and more.

https://medium.com/@trivagotech/teardown-rebuild-migrating-from-hive-to-pyspark-324176a7ce5

The StreamSets blog has a post walking through how to configure StreamSets as both a reader and writer for Apache Pulsar.

https://streamsets.com/blog/getting-started-apache-pulsar-streamsets-data-collector/

A thorough tour of encryption in transit, authentication, and authorization in Apache Kafka. It rounds out the Kafka security tops by covering ACLs and authentication from Kafka to ZooKeeper.

https://medium.com/@rinu.gour123/apache-kafka-security-need-and-components-of-kafka-52b417d3ca77

The circuit breaker pattern in distributed systems can improve user experience and mitigate cascading failures by failing requests quickly when a downstream system is timing out. This post has a great overview of circuit breakers and how they're implemented in the Istio service mesh for Kubernetes and in the Hystrix java library.

https://www.exoscale.com/syslog/istio-vs-hystrix-circuit-breaker/

This tutorial shows how to build a HTTP data ingestion framework on the AWS stack. Amazon API Gateway, AWS Lambda, and Amazon Kinesis Data Firehose are used to ingest data and land it in S3 as Parquet files. From there, Athena and S3 Select can be used to query the data.

https://medium.com/@geneng/real-time-data-collection-pipeline-at-scale-7cf1f6976da9

The JW Player has a post on Southpaw, their tool for denormalizing records in Kafka using a streaming left outer join. It was built for their use case in change data capture (for which they're using Debezium). The system architecture is based off of the NYTime's monolog, and Southpaw is a purpose-built alternative to streaming frameworks like Flink or Kafka Streams. Southpaw is an open-source project on Github.

https://medium.com/jw-player-engineering/southpaw-176aea5f4583

A look at how one organization has moved to Amazon Spectrum for their data warehouse. Their other tools are interacting with Amazon S3, so this shift in architecture has eliminated some consistency problems when S3 and Redshift became out of sync. The article has some implementation details covering AWS Glue, data formats, and more.

https://medium.com/@hoiy/using-redshift-spectrum-as-our-primary-query-engine-26c768afeb5d

The Confluent blog describes two patterns for deploying KSQL applications—via a set of static queries in a headless mode and exposing a REST API for interactive queries. KSQL compiles to Kafka Streams under the hood, so it's possible to scale out the number of instances in either case.

https://www.confluent.io/blog/deep-dive-ksql-deployment-options

A good introduction to MapR-DB, including its JSON API, CLI, accessing data from Apache Spark, and using Apache Drill to query data using SQL.

https://medium.com/@anicolaspp/interacting-with-mapr-db-58c4f482efa1

The AWS Blog has a post describing all of the security controls and best practices for Amazon EMR.

https://aws.amazon.com/blogs/big-data/best-practices-for-securing-amazon-emr/

It can be challenging to decide when to push operations into SQL vs. implementing them in your application code. This post suggests some situations where you might want to do more with your SQL queries.

https://geshan.com.np/blog/2018/12/you-can-do-it-in-sql/

Jobs

The end of the year finds lots of folks looking for a change. Post a job to the Data Eng Weekly job board for $99. https://jobs.dataengweekly.com/

News

etcd, the distributed key value store, has been accepted into the Cloud Native Computing Foundation (CNCF).

https://www.redhat.com/en/blog/red-hat-contributes-etcd-cornerstone-kubernetes-cloud-native-computing-foundation

This article offers a good discussion of the tension between growing adoption through permissive open source licenses and capturing the value of the software. There are several examples of how permissive licenses are playing out and how some companies are licensing their software to compete in a SaaS world—the notion of Commercial Open Source Software.

https://www.cbronline.com/opinion/cloud-native-open-source

A commentary on the rise of Kubernetes and how recent features for persistence are making it a contender in the big data space (although tools like YARN are still useful for queuing jobs).

https://www.zdnet.com/article/the-rise-of-kubernetes-epitomizes-the-move-from-big-data-to-flexible-data/

In the first of what's likely many year in review articles, Datanami looks at the themes and trends in big data. It covers things like GDPR, the big funding rounds that companies are pulling in, and the Hortonworks + Cloudera merger.

https://www.datanami.com/2018/12/13/2018-a-big-data-year-in-review/

Apache Griffin, the tool for defining and measuring data quality metrics, has been promoted to a top-level Apache Software Foundation project.

https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces43

Confluent announced this week that they're relicensing several components in their stack under the Confluent Community License. From their post, which has much more on the reasoning behind the decision as well as several FAQs: "This new license allows you to freely download, modify, and redistribute the code (very much like Apache 2.0 does), but it does not allow you to provide the software as a SaaS offering."

https://www.confluent.io/blog/license-changes-confluent-platform

Releases

Version 0.14.0 of Apache Gobblin (incubating), was released. Gobblin is a data integration framework for moving data between systems that originated at LinkedIn. The release includes lots of new features for the Gobblin-as-a-Service component.

https://lists.apache.org/thread.html/dfc3193a11a1f43384d7fb17631d0880543ee716d0033f3c2ab705b2@%3Cannounce.apache.org%3E

Hortonworks DataFlow 3.3 was released with support for Apache Kafka 2.0 and Kafka Streams. The post has more details about the features of the release.

https://hortonworks.com/blog/whats-new-in-hortonworks-dataflow-hdf-3-3/

Version 2.2 of the Lenses Platform for Apache Kafka and Kubernetes has been released. Major features include a new data security and policy feature, improvements to the SQL engine that supports both data at rest and data in Kafka, and more. The Landoop blog has additional details on the release.

https://www.landoop.com/blog/2018/12/lenses-22-release/

The Streamlio Community Edition is now available as a Kubernetes application on Google Cloud Platform.

https://streaml.io/about/newsreleases/streamlio-brings-cloud-native-fast-data-processing-powered-by-apache-pulsar-to-google-cloud-platform

Apache Hivemall (incubating), the machine learning library implemented as UDFs, released version 0.5.2-incubating.

https://lists.apache.org/thread.html/af4bfa6e44a89092f239a292686eb878936c0fb65652164227f7e777@%3Cannounce.apache.org%3E

The DataStax Kafka Connector is a new application for streaming data form Apache Kafka to DataStax Enterprise clusters. It has a number of features, which are described in the blog post.

https://www.datastax.com/2018/12/introducing-the-datastax-apache-kafka-connector

Loki is a new open source project that aims to be "Like Prometheus, but for logs"

https://github.com/grafana/loki

Sparklens is tool for analyzing performance of Spark jobs. Qubole has created a new website, Sparklens Report, to visualize the JSON reports it generates.

https://www.qubole.com/blog/introducing-sparklens-report/

Apache Beam 2.9.0 is released. It includes a number of dependency upgrades and fixes for the Flink and Spark runners.

https://beam.apache.org/blog/2018/12/13/beam-2.9.0.html

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

Washington

Kubernetes Seattle: K8s at Salesforce, a Deep Dive (Bellevue) - Wednesday, December 19
https://www.meetup.com/Seattle-Kubernetes-Meetup/events/256954051/

North Carolina

Getting Started with Apache Spark (Raleigh) - Tuesday, December 18
https://www.meetup.com/tripass/events/256846805/

New York

Automated Testing in the Modern Data Warehouse (New York) - Tuesday, December 18
https://www.meetup.com/Analytics-Data-Science-by-Dataiku-NY/events/256572835/

POLAND

Real-Time Stream Processing with Apache Flink (Krakow) - Tuesday, December 18
https://www.meetup.com/NOVOTech-Tour/events/256775875/

ISRAEL

Big Data Meetup (Tel Aviv-Yafo) - Tuesday, December 18
https://www.meetup.com/TaboolaIL/events/256906849/

SOUTH KOREA Flink Seoul First Meetup with Data Artisans (Seoul) - Tuesday, December 18
https://www.meetup.com/Seoul-Apache-Flink-Meetup/events/255575590/