Data Eng Weekly Issue #299

27 January 2019

Several technologies that are less commonly covered in the newsletter in this week's issue. Topics include Logstash on Kubernetes, the Alpakka stream processing framework, experiences with Bigtable, and SQL on Apache Beam. There are also several new releases (including a couple of newly open sourced) this week, and a couple of good posts on general distributed systems, including the two-phase commit protocol.

Technical

This article shows how to run PySpark in a docker container with a Jupyter notebook for querying data. There's a tutorial of analyzing and visualizing open government data.

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867

This post walks through deploying Logstash with a Filebeat to ingest logs for pods running on Kubernetes.

https://towardsdatascience.com/the-basics-of-deploying-logstash-pipelines-to-kubernetes-94a470ad34d9

Alpakka is a streaming platform built atop of Akka and Reactive Streams. This post describes why one company decided to use Alpakka rather than alternative stream processing frameworks. They cite the abilities to make async calls to external services and to parallelize steps in the pipeline as key features of Alpakka over Kafka Streams and Spark Streaming.

https://medium.com/@unmeshvjoshi/choosing-a-stream-processing-framework-spark-streaming-or-kafka-streams-or-alpakka-kafka-fa0422229b25

An overview of enriching data in an Apache Beam pipeline with data from an external RDBMS. It looks at some of the key primitives and a few different solutions (with their tradeoffs).

https://medium.com/@tnymltn/enrichment-pipeline-patterns-using-apache-beam-4b9b81e5d9f3

With the caveat that you should always benchmark your own workloads, this post has a performance and cost analysis of the Intel and AMD instance types for big data systems in AWS. They found that the Intel (r5) instances are faster than AMD (r5) across a few different benchmarks.

https://www.qubole.com/blog/intel-vs-amd-comparing-instance-types-big-data-workloads/

A good introduction to Apache Kafka that helps to emphasize key features and several things Kafka explicitly doesn't do. If you're a new user of the project, this post has a lot of good context.

https://medium.freecodecamp.org/what-to-consider-for-painless-apache-kafka-integration-df559e828876

The Ravelin team shares their experiences of using Google Bigtable for a year and a half. They migrated a large PostgreSQL database to Bigtable, and talk about latency, replication, and more once on Bigtable.

https://syslog.ravelin.com/the-joy-and-pain-of-using-google-bigtable-4210604c75be

StreamSets has implemented encryption and decryption using the AWS Encryption SDK with keys from AWS Key Managed Service or other places (like Vault). This tutorial shows how to implement streaming encryption and decryption when sending to/reading from Amazon Kinesis.

https://streamsets.com/blog/encrypt-decrypt-data-dataflow-pipelines/

THRON shares seven tips that they've learned working with Apache Spark in the cloud. They have recommendations on data skew, memory settings, storing data in Amazon S3 (both copying data in and file formats), and more.

https://medium.com/thron-tech/optimising-spark-rdd-pipelines-679b41362a8a

Woot.com writes about how they've implemented their data platform using AWS Glue, Amazon Kinesis Data Firehouse, Amazon Athena, and other serverless technologies.

https://aws.amazon.com/blogs/big-data/our-data-lake-story-how-woot-com-built-a-serverless-data-lake-on-aws/

WePay writes about how they've implemented a high availability solution for MySQL using Github's Orchestrator, HAProxy, Consul, and pt-heartbeat. In Kubernetes, they're using HAProxy as a sidecar, and consul-template provides dynamic HAProxy configuration changes. The post has a detailed overview of how both planned and unplanned failure happen, and how well the solution has worked thus far to provide less than 1 minute of downtime during failover.

https://wecode.wepay.com/posts/highly-available-mysql-clusters-at-wepay

A good take on why Apache Kafka and Kafka Connect,when used at the center of an architecture, provide an escape hatch. For example, if you make the wrong bet on a storage backend, you can send data from Kafka to a new backend to migrate away.

https://riccomini.name/kafka-escape-hatch

An overview of using Helm charts to deploy the Confluent Platform on Amazon EKS. The post includes commands to get a shell on various pods to execute commands that help explore and debug.

https://medium.com/devopslinks/install-confluent-kafka-platform-onto-amazon-eks-using-helm-chart-33c36a3b112

A look at the new features in Apache Airflow 1.10.2, including improvements to support for sensors, theming of the RBAC UI, and more.

https://blog.godatadriven.com/airflow-1_10_2-release

Apache Beam now supports SQL via the Java SDK, which is explored in this article that runs pipelines with the Google Cloud Dataflow backend.

https://medium.com/weareservian/exploring-beam-sql-on-google-cloud-platform-b6c77f9b4af4

Qubole writes about an optimization they've implemented for Apache Spark when writing data to Amazon S3. The so-called "Direct Writes" avoid a final copy as part of writing to an Apache Hive table, which is up to 40x faster in some of their experiments.

https://www.qubole.com/blog/direct-writes-to-increase-spark-performance/

This post has a great introduction to two-phase commit, which is commonly used for transactions in distributed systems. The author argues that its complexity and problems are due to a basic assumption that needs to be rethought, and he then describes a potential replacement that simplifies implementation.

http://dbmsmusings.blogspot.com/2019/01/its-time-to-move-on-from-two-phase.html

This series, which is in progress, is detailing the implementation of a distributed resource allocation library. The first two parts are live covering the basic requirements and the design (such as failure scenarios) and invariants.

https://jack-vanlightly.com/blog/2019/1/25/building-a-simple-distributed-system-the-what

News

Big Data Tech Warsaw 2019 has announced some of the speakers for the upcoming conference, which takes place on February 27th.

https://medium.com/getindata-blog/netflix-booking-com-zalando-and-20-other-data-driven-companies-at-big-data-tech-warsaw-2019-2097bcb2f384

Datanami has coverage of a new threat to Apache Hadoop, Redis, and ActiveMQ clusters. Attackers are brute-forcing weak passwords of exposed services.

https://www.datanami.com/2019/01/24/more-hadoop-yarn-threats-surface/

Microsoft has acquired Citus Data, the makers of a Postgres extension for running the database on multiple nodes.

https://blogs.microsoft.com/blog/2019/01/24/microsoft-acquires-citus-data-re-affirming-its-commitment-to-open-source-and-accelerating-azure-postgresql-performance-and-scale/

Releases

Apache Hadoop 3.2.0 was released. Highlights of the release include support for the latest Azure Datalake storage, enhanced S3 performance, Hadoop Submarine for deep learning, and a storage policy satisfier to move blocks amongst storage types.

https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces45

StreamSets and Snowflake announced a new partnership and a new StreamSets for Snowflake integration.

https://streamsets.com/blog/snowflake-streamsets-dataops-accelerating-cloud-adoption/

PingCap has open sourced DM, the data migration platform. The 1.0.0-alpha version focuses on migrating data (including via incremental migration) from MySQL/MariaDB to TiDB.

https://github.com/pingcap/dm

Version 1.2.1 of Stream Reactor, which provides Kafka Connect connectors, has been released. The new release adds support for Hive source and sink (including auto creating Hive tables, schema migration, and Parquet+ORC file formats). There are also a number of fixes and new features to the over 25 other connectors.

https://www.landoop.com/blog/2019/01/stream-reactor-kafka-connectors-05/

Version 3.1 of Dremio, the Data-as-a-Service platform, has been released. This release adds multitenant workload controls, performance improvements via the Gandivia (Apache Arrow integration) initiative, and more.

https://www.dremio.com/announcing-dremio-3-1/

Frafka implements Frizzle, which is a message bus for parallel processing using goroutines, atop of Kafka. It's a new open source project with a v0.0.1 release.

https://github.com/qntfy/frafka

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Apache Hadoop Contributors Meetup (Mountain View) - Wednesday, January 30
https://www.meetup.com/Hadoop-Contributors/events/257793743/

Ohio

Cleveland Big Data Meetup (Cleveland) - Monday, January 28
https://www.meetup.com/Cleveland-Hadoop/events/255966056/

Florida

Self-Tuning Data Systems (Jacksonville) - Tuesday, January 29
https://www.meetup.com/jaxbigdata/events/257779529/

New York

Druid NYC Meetup @ Datadog (New York) - Tuesday, January 29
https://www.meetup.com/Apache-Druid-NYC/events/257291460/

Massachusetts

January Presentation Night @ WB Games Boston (Needham) - Thursday, January 31
https://www.meetup.com/Boston-Apache-Spark-User-Group/events/257173292/

SWEDEN

Accelerating Data to Gain Insights! (Stockholm) - Wednesday, January 30
https://www.meetup.com/Pentaho-Stockholm-Meetup/events/257099029/

LATVIA

Evolution Gaming: Event Sourcing Using the Kafka-Journal (Riga) - Wednesday, January 30
https://www.meetup.com/Riga-Scala-Community/events/257926307/

SPAIN

Speeding Up PySpark with Arrow, by Ruben Berenguel (Barcelona) - Thursday, January 31
https://www.meetup.com/Spark-Barcelona/events/258139339/

AUSTRALIA

January Data Engineering Meetup @ Atlassian (Sydney) - Thursday, January 31
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/256905159/

Data Engineering Melbourne Meetup (Melbourne) - Thursday, January 31
https://www.meetup.com/Data-Engineering-Melbourne/events/257724136/