16 December 2018
Posts this week covering the circuit breaker pattern and distributed transactions for microservices, a deep dive on secure configuration in Apache Kafka, Trivago's move from Apache Hive to PySpark, a new open source library for JW Player to denormalize CDC stream data, and more. Several news articles, including the first of many year in review posts, and a smattering of releases round out the issue.
A look at some solutions for implementing multi-service consistency in a microservices architecture. The article describes how one might use both two-phase commit and the saga pattern, and the trade-offs of each solution.
The Datadog blog has a post on ActiveMQ covering its architecture (including the Java Message Service/JMS API and its message persistence model), monitoring ActiveMQ via its JMX, JVM, and host-level metrics, and monitoring metrics with Datadog. There are example alerts to detect out of memory, out of disk, runaway queues, and more.
https://www.datadoghq.com/blog/activemq-architecture-and-metrics/
Apache Spark 2.4 has a new API for reading image data into a data frame. The Databricks blog has an overview of the API, a description of the schema for image data, and some example code for building a deep learning pipeline that consumes images.
GraphIt is a new graph processing library and DSL that shows some great performance improvements over other frameworks. It's a research project out of MIT and Adobe Research, and the code is on github.
https://www.datanami.com/2018/12/10/graphit-promises-big-speedup-in-graph-processing/
Trivago's data science team migrated ML pipelines from Apache Hive to PySpark. They cite several reasons for the change, including Hive's SQL dialect, the challenge of implementing unit tests (they were instead testing on a replica of a production cluster), slow time to production, and poor developer tools. With Spark, they use R for analysis and PySpark for production. Not everything was smooth sailing, though—obstacles include problems with UDFs, optimizing joins, repartitioning data, and more.
https://medium.com/@trivagotech/teardown-rebuild-migrating-from-hive-to-pyspark-324176a7ce5
The StreamSets blog has a post walking through how to configure StreamSets as both a reader and writer for Apache Pulsar.
https://streamsets.com/blog/getting-started-apache-pulsar-streamsets-data-collector/
A thorough tour of encryption in transit, authentication, and authorization in Apache Kafka. It rounds out the Kafka security tops by covering ACLs and authentication from Kafka to ZooKeeper.
https://medium.com/@rinu.gour123/apache-kafka-security-need-and-components-of-kafka-52b417d3ca77
The circuit breaker pattern in distributed systems can improve user experience and mitigate cascading failures by failing requests quickly when a downstream system is timing out. This post has a great overview of circuit breakers and how they're implemented in the Istio service mesh for Kubernetes and in the Hystrix java library.
https://www.exoscale.com/syslog/istio-vs-hystrix-circuit-breaker/
This tutorial shows how to build a HTTP data ingestion framework on the AWS stack. Amazon API Gateway, AWS Lambda, and Amazon Kinesis Data Firehose are used to ingest data and land it in S3 as Parquet files. From there, Athena and S3 Select can be used to query the data.
https://medium.com/@geneng/real-time-data-collection-pipeline-at-scale-7cf1f6976da9
The JW Player has a post on Southpaw, their tool for denormalizing records in Kafka using a streaming left outer join. It was built for their use case in change data capture (for which they're using Debezium). The system architecture is based off of the NYTime's monolog, and Southpaw is a purpose-built alternative to streaming frameworks like Flink or Kafka Streams. Southpaw is an open-source project on Github.
https://medium.com/jw-player-engineering/southpaw-176aea5f4583
A look at how one organization has moved to Amazon Spectrum for their data warehouse. Their other tools are interacting with Amazon S3, so this shift in architecture has eliminated some consistency problems when S3 and Redshift became out of sync. The article has some implementation details covering AWS Glue, data formats, and more.
https://medium.com/@hoiy/using-redshift-spectrum-as-our-primary-query-engine-26c768afeb5d
The Confluent blog describes two patterns for deploying KSQL applications—via a set of static queries in a headless mode and exposing a REST API for interactive queries. KSQL compiles to Kafka Streams under the hood, so it's possible to scale out the number of instances in either case.
https://www.confluent.io/blog/deep-dive-ksql-deployment-options
A good introduction to MapR-DB, including its JSON API, CLI, accessing data from Apache Spark, and using Apache Drill to query data using SQL.
https://medium.com/@anicolaspp/interacting-with-mapr-db-58c4f482efa1
The AWS Blog has a post describing all of the security controls and best practices for Amazon EMR.
https://aws.amazon.com/blogs/big-data/best-practices-for-securing-amazon-emr/
It can be challenging to decide when to push operations into SQL vs. implementing them in your application code. This post suggests some situations where you might want to do more with your SQL queries.
https://geshan.com.np/blog/2018/12/you-can-do-it-in-sql/
The end of the year finds lots of folks looking for a change. Post a job to the Data Eng Weekly job board for $99. https://jobs.dataengweekly.com/
etcd, the distributed key value store, has been accepted into the Cloud Native Computing Foundation (CNCF).
This article offers a good discussion of the tension between growing adoption through permissive open source licenses and capturing the value of the software. There are several examples of how permissive licenses are playing out and how some companies are licensing their software to compete in a SaaS world—the notion of Commercial Open Source Software.
https://www.cbronline.com/opinion/cloud-native-open-source
A commentary on the rise of Kubernetes and how recent features for persistence are making it a contender in the big data space (although tools like YARN are still useful for queuing jobs).
In the first of what's likely many year in review articles, Datanami looks at the themes and trends in big data. It covers things like GDPR, the big funding rounds that companies are pulling in, and the Hortonworks + Cloudera merger.
https://www.datanami.com/2018/12/13/2018-a-big-data-year-in-review/
Apache Griffin, the tool for defining and measuring data quality metrics, has been promoted to a top-level Apache Software Foundation project.
https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces43
Confluent announced this week that they're relicensing several components in their stack under the Confluent Community License. From their post, which has much more on the reasoning behind the decision as well as several FAQs: "This new license allows you to freely download, modify, and redistribute the code (very much like Apache 2.0 does), but it does not allow you to provide the software as a SaaS offering."
https://www.confluent.io/blog/license-changes-confluent-platform
Version 0.14.0 of Apache Gobblin (incubating), was released. Gobblin is a data integration framework for moving data between systems that originated at LinkedIn. The release includes lots of new features for the Gobblin-as-a-Service component.
Hortonworks DataFlow 3.3 was released with support for Apache Kafka 2.0 and Kafka Streams. The post has more details about the features of the release.
https://hortonworks.com/blog/whats-new-in-hortonworks-dataflow-hdf-3-3/
Version 2.2 of the Lenses Platform for Apache Kafka and Kubernetes has been released. Major features include a new data security and policy feature, improvements to the SQL engine that supports both data at rest and data in Kafka, and more. The Landoop blog has additional details on the release.
https://www.landoop.com/blog/2018/12/lenses-22-release/
The Streamlio Community Edition is now available as a Kubernetes application on Google Cloud Platform.
Apache Hivemall (incubating), the machine learning library implemented as UDFs, released version 0.5.2-incubating.
The DataStax Kafka Connector is a new application for streaming data form Apache Kafka to DataStax Enterprise clusters. It has a number of features, which are described in the blog post.
https://www.datastax.com/2018/12/introducing-the-datastax-apache-kafka-connector
Loki is a new open source project that aims to be "Like Prometheus, but for logs"
https://github.com/grafana/loki
Sparklens is tool for analyzing performance of Spark jobs. Qubole has created a new website, Sparklens Report, to visualize the JSON reports it generates.
https://www.qubole.com/blog/introducing-sparklens-report/
Apache Beam 2.9.0 is released. It includes a number of dependency upgrades and fixes for the Flink and Spark runners.
https://beam.apache.org/blog/2018/12/13/beam-2.9.0.html
Curated by Datadog ( http://www.datadog.com )
Kubernetes Seattle: K8s at Salesforce, a Deep Dive (Bellevue) - Wednesday, December 19
https://www.meetup.com/Seattle-Kubernetes-Meetup/events/256954051/
Getting Started with Apache Spark (Raleigh) - Tuesday, December 18
https://www.meetup.com/tripass/events/256846805/
Automated Testing in the Modern Data Warehouse (New York) - Tuesday, December 18
https://www.meetup.com/Analytics-Data-Science-by-Dataiku-NY/events/256572835/
Real-Time Stream Processing with Apache Flink (Krakow) - Tuesday, December 18
https://www.meetup.com/NOVOTech-Tour/events/256775875/
Big Data Meetup (Tel Aviv-Yafo) - Tuesday, December 18
https://www.meetup.com/TaboolaIL/events/256906849/
SOUTH KOREA
Flink Seoul First Meetup with Data Artisans (Seoul) - Tuesday, December 18
https://www.meetup.com/Seoul-Apache-Flink-Meetup/events/255575590/