登录查看更多内容

Canary across multi-cluster with Anthos Service Mesh Part 3/4

Naveen S.R

Heading the AirAsia Move Flights Engineering Team | Engineering Leader | Travel Anchored SuperApp

发布日期: 2023年7月16日

+ 关注

Authored by?Nayana Madhav,?Rahul Prajapati,?Tushar Bhattacharya?and?Naveen S.R

In our previous article, we explored the reasons behind our strategic shift towards GKE Autopilot.?

Followed by how to efficiently scale and design for spare capacity by defining balloon pods

In this edition we will look at an essential launch strategy to have the ability canary across clusters using ASM.

To configure Anthos Service Mesh on your GKE clusters, ensure the following prerequisites are met:

Install Anthos Service Mesh version 1.11 or later using asmcli install and have asmcli, istioctl, and necessary samples downloaded.
?Ensure connectivity between all pods within the clusters before configuring Anthos Service Mesh.
?If joining clusters from different projects, register them to the same fleet host project and set up a shared VPC configuration.
It is recommended to have a single subnet in the same VPC for private clusters, or ensure that control planes can reach each other via cluster private IPs.

These prerequisites ensure the successful configuration of Anthos Service Mesh on your GKE clusters.

Set up a multi-cluster mesh on GKE with ASM

The entry point of incoming traffic is the HTTPS load balancer which is deployed only in primary GKE Autopilot cluster?
After the traffic enters in GKE Autopilot it gets segregated into both clusters according to split percent ratio of virtual services?

The following steps are fully automated with CI

Rollout Canary : Multi Cluster Mesh will be configured
Deploy Workloads : Automatically deploy all workloads to secondary cluster
Rollback Canary : Rollback the setup to initial configuration with single GKE Autopilot cluster
Split-traffic-platform :This will ensure traffic gets split to both clusters based on canary values

Define required Variables:

Create runtime variables which will be used as weight in virtual service config for the services which are directly accessible through ingress, I just created for one service only

Stage 1: Deploy-Workloads

To efficiently deploy all the services on a secondary GKE cluster before configuring the Multi-Cluster Mesh as part of the on-demand GKE Cluster and ASM setup, Here is demo for services which is part of Zodiac namespace

The envsubst command for substituting environment variables in files, which updates the container images with the latest tags. This ensures that whenever we deploy the services, they have the latest image
Following shell script will fetch the image tag and it will update in services manifests along with service.yaml, hpa.yaml. virtual-service.yaml and destination-rule.yaml?

Stage 2: Rollout Canary

Get GKE cluster context:

Create a Firewall Rule:

To enable cross-cluster traffic in Anthos Service Mesh, you may need to create a firewall rule in the following cases:

Different subnets: If your clusters in the mesh are using different subnets, you need to set up firewall rules explicitly to allow cross-subnet traffic.
If your Pods open ports other than the default ports 443 and 15002, you need to create firewall rules to allow traffic on those ports.
GKE automatically adds firewall rules for intra-subnet communication within each node. However, for cross-subnet traffic or non-standard ports, you must manually configure the firewall rules.?

By following these instructions and setting up the appropriate firewall rules, you can ensure proper cross-cluster communication in Anthos Service Mesh.

Keeping Traffic in Cluster:

In some cases the default cross-cluster load balancing behavior is not desirable. To keep traffic “cluster-local” calls (i.e. traffic sent from cluster-a will only reach destinations in cluster-a), mark hostnames or wildcards as clusterLocal using MeshConfig.serviceSettings.

Suppose the same services are being deployed on both clusters when first traffic enters in each cluster it should call only services within the same cluster.

For example, you can enforce cluster-local traffic for an individual service, all services in a particular namespace, or globally for all services in the mesh.

NOTE: This will be applicable across both Autopilot and Standard Clusters

Many pages had references to IstioOperator CRD where you create a custom resource and let Istio translate it for you. Unfortunately, in a managed control plane, this operator is hidden / not usable.?

This page covers how to configure optional features on Managed Anthos Service Mesh. It had very few configuration examples and loosely aligned with the Istio reference, though some translation required in the configuration schema

Using the migration tool we can convert the istioOperator CRD into ConfigMap for Managed Data plane

The migration tool is available as part of the asmcli script. You must download the script to use this tool.

Run the migration tool:

领英推荐

Bytes & Microsoft: Your Microsoft Newsletter for…

Bytes Software Services 5 个月前

Microsoft Build 2024: Copilot AI Will Gain ‘Personal…

TechRepublic 9 个月前

Rugged Edge Insights: Exploring the Current State of…

Premio Inc. 9 个月前

? ? ? ? ? configmap-clusterlocal.yaml

After Converting apply in both GKE Cluster

Configure Endpoint Discovery:

Configure remote secrets to allow API server access to the cluster to the other cluster’s Anthos Service Mesh control plane. The commands depend on your Anthos Service Mesh type (either in-cluster or managed),here we are using managed asm

First we need to download istioctl executable?
Discover control plane private IPs for both GKE Cluster
In last we need to inject each other's secrets in both GKE cluster?

The following complete flow of setting up of Multi-cluster Canary

Stage 3: Split Traffic across both GKE Cluster

The following changes required to have desired trafficd split

To differentiate between pods of both clusters, you can add a suffix at the end of the deployment names. By doing so, each deployment in each cluster will have a unique name in both GKE Cluster
To redirect traffic to the respective workload based on weightage, you need to add labels to all the deployments that match the labels specified in the destination rule subsets. This can be done by modifying the deployment manifests.

First it installs the gettext-base package, which provides the envsubst command for substituting environment variables in files
The script sets the environment variables ZODIAC_ENGINE_AUTOPILOT_TRAFFIC and ZODIAC_ENGINE_STANDARD_TRAFFIC to determine the traffic split percentages. If we choose to allocate 70 of the traffic to the GKE Autopilot Cluster, we need to update the value of ZODIAC_ENGINE_AUTOPILOT_TRAFFIC to 70. When we trigger the pipeline, the remaining 30% will automatically gets assigned to the ZODIAC_ENGINE_STANDARD_TRAFFIC variable, and the traffic will be redirected to the secondary cluster, which is GKE Standard cluster.
In last it applies the destination rule in secondary cluster and updates the virtual service and destination rule in first GKE Autopilot cluster.

Anthos Service Mesh Dashboard

Before the canary : 100% traffic was getting pointed to primary GKE autopilot cluster

Splitting 95% at GKE Autopilot and 5% at GKE Standard?

Splitting 75% at GKE Autopilot and GKE Standard

Following are the some example how downstream services traffic get sticked in same cluster , traffic split equally for the inter-service communications

Stage 4: Rollback Canary Setup

Point 100% Incoming traffic to primary GKE Autopilot Cluster

Before destroying the setup mesh and cluster, it is crucial to ensure that 100% of the traffic is forwarded to the primary cluster only. This can be achieved by modifying the traffic splitting configuration or destination rules to direct all traffic to the primary cluster. By doing so, you can avoid any disruptions or potential data loss during the process of destroying the secondary cluster and dismantling the multi-cluster mesh setup.

Cleaning up the Endpoint Discovery

Its critical step when rolling back a canary setup. It ensures that intra-cluster communications between the canary and primary deployments no longer occur. This step is necessary to revert back to the previous stable state and prevent any unintended interactions between the secondary cluster and primary cluster GKE Autopilot. By cleaning up the Endpoint Discovery, you effectively isolate the clusters and restore the normal communication patterns within the cluster.

Delete Firewall Rule

To revert back cross-cluster communication, you need to delete the firewall rule that was created earlier. This firewall rule was responsible for allowing traffic between the clusters. By deleting the rule, you effectively disable cross-cluster communication.

Unregister Secondary GKE Cluster from Fleet

To delete Anthos Service Mesh from the cluster, you need to unregister the GKE fleet membership . This step ensures that in the future, if you create a cluster with the same name, it won’t create any conflicts with the existing membership fleet.
By unregistering the GKE fleet membership, you remove the cluster’s association with the fleet, allowing you to safely delete Cluster. This ensures a clean removal and avoids any potential conflicts or inconsistencies in the fleet configuration

Following these steps will ensure that Anthos Service Mesh is fully removed and the cluster is unregistered from the fleet. Then, you can safely proceed with deleting the secondary GKE Standard cluster

Conclusion?

On-demand canary setup with multi clusters allows for fine-grained traffic management and control.
Traffic can be split across multiple Kubernetes clusters to reduce risk and improve reliability.
Controlled rollouts can be achieved by directing traffic to specific clusters for testing and validation.
This setup enables gathering feedback, identifying issues, and making improvements before deploying changes to the entire user base.
Multiple clusters provide increased resilience and fault tolerance.
If issues arise in one cluster, traffic can be redirected to other clusters seamlessly.
Scaling resources on-demand ensures optimal performance and accommodates varying workloads.
Efficient resource allocation and handling increased traffic are possible without compromising reliability or user experience.
Overall, the on-demand canary setup with multi clusters enhances traffic management capabilities and supports risk reduction, improved reliability, and controlled rollouts for the Zodiac platform.

The fourth and the final part of this edition on the key face palm moments in our journey is captured here

要查看或添加评论，请登录

Naveen S.R的更多文章

Scheduler as a Platform

2023年8月27日

Scheduler as a Platform

Authored by Sai Vamsi Alisetti An essential component of every business is to schedule certain tasks for automation…
From GKE Standard to Autopilot: Lessons Learned, Surprises Embraced, and a Few Facepalm Moments Part 4/4

2023年7月16日

From GKE Standard to Autopilot: Lessons Learned, Surprises Embraced, and a Few Facepalm Moments Part 4/4

Authored by Nayana Madhav, Rahul Prajapati, Tushar Bhattacharya and Naveen S.R In our previous article, we explored the…
Scaling with Efficiency: GKE Autopilot Part 2/4

2023年7月16日

Scaling with Efficiency: GKE Autopilot Part 2/4

Authored by Nayana Madhav, Rahul Prajapati, Tushar Bhattacharya and Naveen S.R In our previous article, we explored the…
Accelerating Innovation - Journey to GKE Autopilot - Part 1/4

2023年7月16日

Accelerating Innovation - Journey to GKE Autopilot - Part 1/4

Authored by Nayana Madhav, Rahul Prajapati, Tushar Bhattacharya and Naveen S.R AirAsia started off as an airline and in…
Building a self healing platform

2023年6月24日

Building a self healing platform

Authored by Nalina Madhav c and Naveen S.R At AirAsia, we believe that travel is more than just a journey from one…

3 条评论
Incident Management With Playbooks

2023年6月24日

Incident Management With Playbooks

Authored by Nalina Madhav c and Naveen S.R Part 2/3 to build a self healing platform Alerting for Timely Incident…
The Four Golden Signals

2023年6月24日

The Four Golden Signals

Authored by Nalina Madhav c and Naveen S.R Part 1/3 to build a self healing platform MTTR: The Key Metric For Assessing…
Top Five Engineering KPI For a Spin Off

2023年6月15日

Top Five Engineering KPI For a Spin Off

In any engineering organisation, initiatives that we embark on will have to be measured in terms of quantifiable…

3 条评论
Unleashing the airasia super app

2021年6月8日

Unleashing the airasia super app

AirAsia.com as an airline we have 65+ million customers who come into our platform day in and day out shopping for our…

18 条评论
4 Myth-Busting Stories of Finding Extraordinary Talent

2021年5月21日

4 Myth-Busting Stories of Finding Extraordinary Talent

How does a fast-growing super app hire the best people for the best teams to do their best work? The airasia.com…

3 条评论

See all articles

Canary across multi-cluster with Anthos Service Mesh Part 3/4

Naveen S.R

Heading the AirAsia Move Flights Engineering Team | Engineering Leader | Travel Anchored SuperApp

In our previous article, we explored the reasons behind our strategic shift towards GKE Autopilot.?

Followed by how to efficiently scale and design for spare capacity by defining balloon pods

Set up a multi-cluster mesh on GKE with ASM

Define required Variables:

Stage 1: Deploy-Workloads

Stage 2: Rollout Canary

Create a Firewall Rule:

Keeping Traffic in Cluster:

领英推荐

Configure Endpoint Discovery:

Stage 3: Split Traffic across both GKE Cluster

Anthos Service Mesh Dashboard

Stage 4: Rollback Canary Setup

Point 100% Incoming traffic to primary GKE Autopilot Cluster

Cleaning up the Endpoint Discovery

Delete Firewall Rule

Unregister Secondary GKE Cluster from Fleet

Conclusion?

Naveen S.R的更多文章

社区洞察

其他会员也浏览了

Navigating the New Era of Microsoft Copilot: Insights for Businesses

Copilot, Copilot, Copilot!

Ancoris Newsletter - September 2024

ServiceNow Expands AI Use Cases

Introducing Microsoft Copilot ?

Mystic AI - Product Update, Feb '24

Sumo Logic Focusing on the Digital Gaming Industry

You’ve heard of Copilot… but what is it?

Microsoft Copilot Unveils Expansions and New Features

?? Microsoft Copilot Latest Updates – Feeling confused? You are not alone!

In our previous article, we explored the reasons behind our strategic shift towards GKE Autopilot.?

Followed by how to efficiently scale and design for spare capacity by defining balloon pods

Set up a multi-cluster mesh on GKE with ASM

Define required Variables:

Stage 1: Deploy-Workloads

Stage 2: Rollout Canary

Create a Firewall Rule:

Keeping Traffic in Cluster:

领英推荐

Configure Endpoint Discovery:

Stage 3: Split Traffic across both GKE Cluster

Anthos Service Mesh Dashboard

Stage 4: Rollback Canary Setup

Point 100% Incoming traffic to primary GKE Autopilot Cluster

Cleaning up the Endpoint Discovery

Delete Firewall Rule

Unregister Secondary GKE Cluster from Fleet

Conclusion?

Naveen S.R的更多文章

Scheduler as a Platform

From GKE Standard to Autopilot: Lessons Learned, Surprises Embraced, and a Few Facepalm Moments Part 4/4

Scaling with Efficiency: GKE Autopilot Part 2/4

Accelerating Innovation - Journey to GKE Autopilot - Part 1/4

Building a self healing platform

Incident Management With Playbooks

The Four Golden Signals

Top Five Engineering KPI For a Spin Off

Unleashing the airasia super app

4 Myth-Busting Stories of Finding Extraordinary Talent

社区洞察

其他会员也浏览了

Navigating the New Era of Microsoft Copilot: Insights for Businesses

Copilot, Copilot, Copilot!

Ancoris Newsletter - September 2024

ServiceNow Expands AI Use Cases

Introducing Microsoft Copilot ?

Mystic AI - Product Update, Feb '24

Sumo Logic Focusing on the Digital Gaming Industry

You’ve heard of Copilot… but what is it?

Microsoft Copilot Unveils Expansions and New Features

?? Microsoft Copilot Latest Updates – Feeling confused? You are not alone!