Kubernetes on Azure: Optimise them all

Kubernetes on Azure: Optimise them all

It's Tech Wednesday again, and today, let me put on my cloud engineer hat and talk about Azure and Kubernetes.

Data solutions are growing by the day: Data Mesh, Data Hub, Data Lake, Data and Lore, well, you name it, and with K8s, a marriage is made in heaven. Combine containerised data pipelines with microservices and distributed resources. You can get a lot of flexibility and options, but (there is always a catch) you are also gaining a problem, which is tied to "resource efficiency". And that is exactly the topic for today: how to ensure I can get the most out of my Azure K8's deployment without robbing a bank or selling a kidney to pay Mr Microsoft the bill.

Resource Request and Limits

Kubernetes allocates CPU and Memory to pods based on "requests" (a.k.a guaranteed resources) and "Limits" (maximum resources). For data workloads (heavy ones), these two factors directly impact both cost and performance. I'll make it clear to you.

  • Requests: How much of a resource a pod "expects" to use
  • Limits: The maximum amount of the pod's resources can "actually" be used.

So, what can we do with this? Simple... or is it? Let me show you an example:


So, in this Yaml template, we set a request of 1vCPU and 2GiB Memory, which will ensure our workload schedule. We are also setting a maximum number of vCPUs and Memory; that way, if our workload needs extra resources, it does not run unchecked and starts to consume more than we can pay.

As an additional tip, use Azure Monitor, or the Kubectl top pods command to monitor resource usage and adjust requests and limits based on that information. This will help you avoid overprovisioning resources while undermining your cash availability.

Node Pools and Node Sizing in AKS

AKS has multiple-node pool support, which means you can create pools with different VM sizes and assign specific workloads to the best nodes based on their needed resources. In other words, we can have a node pool strategy in place that may look like this:

  1. Small Nodes for lightweight services (i.e. API Gateways)
  2. Large Nodes for data processing pods that require heavy CPU or Memory
  3. Spot Nodes for fault-tolerant batch jobs so we can save costs

This is an example of creating a Large Node Pool on AKS.


Once the node is added, we can use nodeSelector to schedule our resource-intensive pods for this node.


This way, data-heavy workloads will not be strangled by low resources in smaller pods. If we go for spot nodes for those batch jobs, we can run them without having to worry about continuity or sequencing, which can save a lot of money. But don't worry—we will cover this in a minute.

Autoscale Smart, not Hard

Two friends can help us with autoscaling: Horizontal Pod Autoscaler and Vertical Pod Autoscaler (HPA and VPA, respectively). What they do is simple:

  • HPA: scales the pods horizontally based on CPU or memory metrics
  • VPA: adjust our requests/limits (yes, the ones from the first part of this article) based on usage metrics.

Now, let's return to our deployment and learn how to enable them; the deployment document for Horizontal Autoscaler should look like this.

And like this one for the vertical one.


Why both? As we said above, HPA will handle fluctuating workloads, while with VPA, we can fine-tune our resource requests over time.

SPOT those cost-reduction opportunities

We mentioned it when we were talking about multi-node and node strategy. We will look at it in detail here if you have fault-tolerant batch processing workloads (and if you are going for data analytics or modelling, you will have many). You can take advantage of Azure SpotVM, a service that allows you to book idle capacity on Azure at a "really" low price compared to regular pricing. The catch is that your spot instances are gone when the capacity stops being idle because other people are creating dedicated resources. Don't worry; if those jobs are fault-tolerant, you can stop the task and resume it when available. It is an easy trick, and this is how we do it on K8s

First, we will add a Spot node Pool to our deployment, so let's go to Powershell.


Let's use the mode selector and tolerations to finish the job.


Now, we have our batch jobs running on the super cheap spot nodes and the eternal gratitude of the CFO, even when he didn't know our names or that we existed at all.

Introducing "The Blob", now in your local cluster

Now, this is important. If you are going for a heavy data workload, this means high read/write operations all the time, so please trust me on this: You want to always use the right storage class if you are looking to optimise your efficiency.

High IOPS (like databases) should go to Premium Managed Disks (aka premium storage), while those large chunks of data you don't access every day must go to Blob storage (Azure version of Object Storage or S3).

How do we do the last? Well, let's go to our deployment file again.


And why NFS-based Blobs? Well, because we assume we are dealing with large datasets, NFS-based blobs improve the access times for those really huge data files.

Finally, here are two good advice: first, don't forget to put the where in the delete form (a classic) and second, optimisation is not a fire and forget; you must constantly be prepared to fine-tune your deployment to squeeze the maximum at every time, for this remember just this: "It's all about monitoring", and Azure has some free tools for you to do it, like Azure Monitor, to chase those cluster-wide metrics, Azure Cost Management to check how well (or not) you are doing in keeping those costs at bay and if you want real-time insights, you can enable the Kubernetes Dashboard in Azure Stack Hub.


And That's all, folks; see you next Wednesday when we have more sessions to understand how small tricks can save big for you in your journey to master cloud engineering concepts.

Stay curious, always take an opportunity to learn, and see you in our next edition.



John Lunn

Azure Specialist at Microsoft | Azure MVP Alumi | MCT | Welsh Azure User Group My views are my own

1 个月

We would love to hear from you at the Welsh Azure User Group! https://sessionize.com/welsh-azure-user-group-cfs/

回复
Marco O.

Sr. Cloud Solution Architect @ Microsoft | Azure Cloud Platform Expert

3 个月

Very useful information and insights for cost optimization on K8s

Debashis Nath

Cloud and DevOps Solution Consultant

3 个月

Another point, It is also beneficial to consider KEDA to add in the Auto Scaling to retain the application performance for any event on API side ??

要查看或添加评论,请登录

Javier Colladon的更多文章

社区洞察

其他会员也浏览了