Canary across multi-cluster with Anthos Service Mesh Part 3/4

Canary across multi-cluster with Anthos Service Mesh Part 3/4


Authored by?Nayana Madhav,?Rahul Prajapati,?Tushar Bhattacharya?and?Naveen S.R

In our previous article, we explored the reasons behind our strategic shift towards GKE Autopilot.?

Followed by how to efficiently scale and design for spare capacity by defining balloon pods


In this edition we will look at an essential launch strategy to have the ability canary across clusters using ASM.

To configure Anthos Service Mesh on your GKE clusters, ensure the following prerequisites are met:

  • Install Anthos Service Mesh version 1.11 or later using asmcli install and have asmcli, istioctl, and necessary samples downloaded.
  • ?Ensure connectivity between all pods within the clusters before configuring Anthos Service Mesh.
  • ?If joining clusters from different projects, register them to the same fleet host project and set up a shared VPC configuration.
  • It is recommended to have a single subnet in the same VPC for private clusters, or ensure that control planes can reach each other via cluster private IPs.

These prerequisites ensure the successful configuration of Anthos Service Mesh on your GKE clusters.

Set up a multi-cluster mesh on GKE with ASM

No alt text provided for this image


  • The entry point of incoming traffic is the HTTPS load balancer which is deployed only in primary GKE Autopilot cluster?
  • After the traffic enters in GKE Autopilot it gets segregated into both clusters according to split percent ratio of virtual services?

The following steps are fully automated with CI

  • Rollout Canary : Multi Cluster Mesh will be configured
  • Deploy Workloads : Automatically deploy all workloads to secondary cluster
  • Rollback Canary : Rollback the setup to initial configuration with single GKE Autopilot cluster
  • Split-traffic-platform :This will ensure traffic gets split to both clusters based on canary values

Define required Variables:

Create runtime variables which will be used as weight in virtual service config for the services which are directly accessible through ingress, I just created for one service only

No alt text provided for this image

Stage 1: Deploy-Workloads

To efficiently deploy all the services on a secondary GKE cluster before configuring the Multi-Cluster Mesh as part of the on-demand GKE Cluster and ASM setup, Here is demo for services which is part of Zodiac namespace

  • The envsubst command for substituting environment variables in files, which updates the container images with the latest tags. This ensures that whenever we deploy the services, they have the latest image
  • Following shell script will fetch the image tag and it will update in services manifests along with service.yaml, hpa.yaml. virtual-service.yaml and destination-rule.yaml?

No alt text provided for this image

Stage 2: Rollout Canary

Get GKE cluster context:

No alt text provided for this image

Create a Firewall Rule:

To enable cross-cluster traffic in Anthos Service Mesh, you may need to create a firewall rule in the following cases:

  • Different subnets: If your clusters in the mesh are using different subnets, you need to set up firewall rules explicitly to allow cross-subnet traffic.
  • If your Pods open ports other than the default ports 443 and 15002, you need to create firewall rules to allow traffic on those ports.
  • GKE automatically adds firewall rules for intra-subnet communication within each node. However, for cross-subnet traffic or non-standard ports, you must manually configure the firewall rules.?

By following these instructions and setting up the appropriate firewall rules, you can ensure proper cross-cluster communication in Anthos Service Mesh.

No alt text provided for this image

Keeping Traffic in Cluster:

In some cases the default cross-cluster load balancing behavior is not desirable. To keep traffic “cluster-local” calls (i.e. traffic sent from cluster-a will only reach destinations in cluster-a), mark hostnames or wildcards as clusterLocal using MeshConfig.serviceSettings.

Suppose the same services are being deployed on both clusters when first traffic enters in each cluster it should call only services within the same cluster.

For example, you can enforce cluster-local traffic for an individual service, all services in a particular namespace, or globally for all services in the mesh.

NOTE: This will be applicable across both Autopilot and Standard Clusters

No alt text provided for this image


Many pages had references to IstioOperator CRD where you create a custom resource and let Istio translate it for you. Unfortunately, in a managed control plane, this operator is hidden / not usable.?

This page covers how to configure optional features on Managed Anthos Service Mesh. It had very few configuration examples and loosely aligned with the Istio reference, though some translation required in the configuration schema

  • Using the migration tool we can convert the istioOperator CRD into ConfigMap for Managed Data plane


The migration tool is available as part of the asmcli script. You must download the script to use this tool.

  1. Run the migration tool:

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

? ? ? ? ? configmap-clusterlocal.yaml


No alt text provided for this image

After Converting apply in both GKE Cluster

No alt text provided for this image

Configure Endpoint Discovery:

Configure remote secrets to allow API server access to the cluster to the other cluster’s Anthos Service Mesh control plane. The commands depend on your Anthos Service Mesh type (either in-cluster or managed),here we are using managed asm

  • First we need to download istioctl executable?
  • Discover control plane private IPs for both GKE Cluster
  • In last we need to inject each other's secrets in both GKE cluster?

No alt text provided for this image

The following complete flow of setting up of Multi-cluster Canary

Stage 3: Split Traffic across both GKE Cluster

The following changes required to have desired trafficd split

  • To differentiate between pods of both clusters, you can add a suffix at the end of the deployment names. By doing so, each deployment in each cluster will have a unique name in both GKE Cluster
  • To redirect traffic to the respective workload based on weightage, you need to add labels to all the deployments that match the labels specified in the destination rule subsets. This can be done by modifying the deployment manifests.

No alt text provided for this image

  • First it installs the gettext-base package, which provides the envsubst command for substituting environment variables in files
  • The script sets the environment variables ZODIAC_ENGINE_AUTOPILOT_TRAFFIC and ZODIAC_ENGINE_STANDARD_TRAFFIC to determine the traffic split percentages. If we choose to allocate 70 of the traffic to the GKE Autopilot Cluster, we need to update the value of ZODIAC_ENGINE_AUTOPILOT_TRAFFIC to 70. When we trigger the pipeline, the remaining 30% will automatically gets assigned to the ZODIAC_ENGINE_STANDARD_TRAFFIC variable, and the traffic will be redirected to the secondary cluster, which is GKE Standard cluster.
  • In last it applies the destination rule in secondary cluster and updates the virtual service and destination rule in first GKE Autopilot cluster.

No alt text provided for this image

Anthos Service Mesh Dashboard

Before the canary : 100% traffic was getting pointed to primary GKE autopilot cluster

No alt text provided for this image

Splitting 95% at GKE Autopilot and 5% at GKE Standard?

No alt text provided for this image

Splitting 75% at GKE Autopilot and GKE Standard

No alt text provided for this image

Following are the some example how downstream services traffic get sticked in same cluster , traffic split equally for the inter-service communications

No alt text provided for this image

Stage 4: Rollback Canary Setup

Point 100% Incoming traffic to primary GKE Autopilot Cluster

  • Before destroying the setup mesh and cluster, it is crucial to ensure that 100% of the traffic is forwarded to the primary cluster only. This can be achieved by modifying the traffic splitting configuration or destination rules to direct all traffic to the primary cluster. By doing so, you can avoid any disruptions or potential data loss during the process of destroying the secondary cluster and dismantling the multi-cluster mesh setup.

No alt text provided for this image

Cleaning up the Endpoint Discovery

Its critical step when rolling back a canary setup. It ensures that intra-cluster communications between the canary and primary deployments no longer occur. This step is necessary to revert back to the previous stable state and prevent any unintended interactions between the secondary cluster and primary cluster GKE Autopilot. By cleaning up the Endpoint Discovery, you effectively isolate the clusters and restore the normal communication patterns within the cluster.

No alt text provided for this image

Delete Firewall Rule

To revert back cross-cluster communication, you need to delete the firewall rule that was created earlier. This firewall rule was responsible for allowing traffic between the clusters. By deleting the rule, you effectively disable cross-cluster communication.

No alt text provided for this image

Unregister Secondary GKE Cluster from Fleet

  • To delete Anthos Service Mesh from the cluster, you need to unregister the GKE fleet membership . This step ensures that in the future, if you create a cluster with the same name, it won’t create any conflicts with the existing membership fleet.
  • By unregistering the GKE fleet membership, you remove the cluster’s association with the fleet, allowing you to safely delete Cluster. This ensures a clean removal and avoids any potential conflicts or inconsistencies in the fleet configuration

Following these steps will ensure that Anthos Service Mesh is fully removed and the cluster is unregistered from the fleet. Then, you can safely proceed with deleting the secondary GKE Standard cluster

No alt text provided for this image

Conclusion?

  • On-demand canary setup with multi clusters allows for fine-grained traffic management and control.
  • Traffic can be split across multiple Kubernetes clusters to reduce risk and improve reliability.
  • Controlled rollouts can be achieved by directing traffic to specific clusters for testing and validation.
  • This setup enables gathering feedback, identifying issues, and making improvements before deploying changes to the entire user base.
  • Multiple clusters provide increased resilience and fault tolerance.
  • If issues arise in one cluster, traffic can be redirected to other clusters seamlessly.
  • Scaling resources on-demand ensures optimal performance and accommodates varying workloads.
  • Efficient resource allocation and handling increased traffic are possible without compromising reliability or user experience.
  • Overall, the on-demand canary setup with multi clusters enhances traffic management capabilities and supports risk reduction, improved reliability, and controlled rollouts for the Zodiac platform.

The fourth and the final part of this edition on the key face palm moments in our journey is captured here




要查看或添加评论,请登录

Naveen S.R的更多文章

  • Scheduler as a Platform

    Scheduler as a Platform

    Authored by Sai Vamsi Alisetti An essential component of every business is to schedule certain tasks for automation…

  • From GKE Standard to Autopilot: Lessons Learned, Surprises Embraced, and a Few Facepalm Moments Part 4/4

    From GKE Standard to Autopilot: Lessons Learned, Surprises Embraced, and a Few Facepalm Moments Part 4/4

    Authored by Nayana Madhav, Rahul Prajapati, Tushar Bhattacharya and Naveen S.R In our previous article, we explored the…

  • Scaling with Efficiency: GKE Autopilot Part 2/4

    Scaling with Efficiency: GKE Autopilot Part 2/4

    Authored by Nayana Madhav, Rahul Prajapati, Tushar Bhattacharya and Naveen S.R In our previous article, we explored the…

  • Accelerating Innovation - Journey to GKE Autopilot - Part 1/4

    Accelerating Innovation - Journey to GKE Autopilot - Part 1/4

    Authored by Nayana Madhav, Rahul Prajapati, Tushar Bhattacharya and Naveen S.R AirAsia started off as an airline and in…

  • Building a self healing platform

    Building a self healing platform

    Authored by Nalina Madhav c and Naveen S.R At AirAsia, we believe that travel is more than just a journey from one…

    3 条评论
  • Incident Management With Playbooks

    Incident Management With Playbooks

    Authored by Nalina Madhav c and Naveen S.R Part 2/3 to build a self healing platform Alerting for Timely Incident…

  • The Four Golden Signals

    The Four Golden Signals

    Authored by Nalina Madhav c and Naveen S.R Part 1/3 to build a self healing platform MTTR: The Key Metric For Assessing…

  • Top Five Engineering KPI For a Spin Off

    Top Five Engineering KPI For a Spin Off

    In any engineering organisation, initiatives that we embark on will have to be measured in terms of quantifiable…

    3 条评论
  • Unleashing the airasia super app

    Unleashing the airasia super app

    AirAsia.com as an airline we have 65+ million customers who come into our platform day in and day out shopping for our…

    18 条评论
  • 4 Myth-Busting Stories of Finding Extraordinary Talent

    4 Myth-Busting Stories of Finding Extraordinary Talent

    How does a fast-growing super app hire the best people for the best teams to do their best work? The airasia.com…

    3 条评论

社区洞察

其他会员也浏览了