Optimize Kubernetes MicroServices
?? Saral Saxena ??????
?11K+ Followers | Linkedin Top Voice || Associate Director || 14+ Years in Java, Microservices, Kafka, Spring Boot, Cloud Technologies (AWS, GCP) | Agile , K8s ,DevOps & CI/CD Expert
we’ll explore how to stabilize your Kubernetes microservices with the correct resource settings to avoid issues like resource over-committing and node crashes.
Pre-required knowledge:
How did everything start?
It was a day when I was on call when my phone rang with the support team on the other end.
Is there a problem? I asked.
Yes, we see that the services are not available, and we need your help.
In this way, I started my journey to understand why our system suffers from unavailable pods.
Ok, I took a long breath and went into this incident.
So what do I need to do first?
To identify which service was experiencing issues, I checked our Grafana dashboard to see which service was unavailable .
so I shouted out loud in the developer’s room — “is there someone deploying something today?”
There was a silence in the room, indicating that no one had been deploying anything.
So what was happening?
To further investigate the issue, I accessed the dashboard that monitors the CPU and memory usage of the affected service.
Upon checking the CPU and memory usage of the pods, I was surprised to find that there were no spikes or abnormalities.
In an effort to identify the root cause of the issue, I next checked the dashboard that monitors the load on the nodes in the cluster
Upon checking the node load, I noticed a spike that indicated that the node was unable to provide the resources that the pods were requesting and?leading it to crash. This suggested that the issue was related to resource constraints on the node.
To effectively troubleshoot issues with unavailable pods in a Kubernetes cluster, it can be helpful to have a basic understanding of how Kubernetes handles resource allocation.
In Kubernetes, you can define the resources that a pod requires in the deployment file using the resources block. This block consists of two fields: requests and limits.
The requests?field specifies the minimum amount of resources that a pod needs to function properly. When scheduling a pod on a node, Kubernetes will only place the pod on a node if it has sufficient resources available to fulfill the resource requests defined in the deployment file.
The limits?field specifies the maximum amount of resources that a pod is allowed to use. If a pod attempts to exceed its resource limits, it may be terminated or throttled to prevent it from consuming too many resources.
What is the problem?
By defining low resource requests and high resource limits, the pods may have been able to request more resources than were available on the node, leading to resource over-committing and potentially causing the service to become unavailable.
Why did the Node crash?
Imagine that you have a Kubernetes cluster with 1 node, the node has 4000 MI of CPU and 8000MB of memory.
Now, imagine that you have a deployment that consists of 4 pods, each with the following resource limits and requests defined in the deployment file:
Now, let’s say that all 4 pods are scheduled onto the same node.
This means that the node would have to provide 3200 MI (800 MI* 4 pods) of CPU and 6000MB (1500MB * 4 pods ) of memory to the pods.
However, the node only has a capacity of 4000 MI of CPU and 8000 MB of memory, which means that if one of the pods uses its full 2000 MI CPU limit, the node could become overloaded and unable to provide the resources that the pods need to run effectively, it may result in an outage for the service that the pods provide.
领英推荐
In this case, Kubernetes will attempt to reschedule the pods onto other nodes in the cluster. However, if all of the nodes in the cluster are already at capacity and there are no available resources to accommodate the pods, it may take some time for Kubernetes to create a new node and reschedule the pods onto it.
In some cases, it may take hours for Kubernetes to create a new node and reschedule the pods, which can lead to extended?outages for the service
How does Kubernetes handle over-commitment?
Kubernetes is designed to schedule pods onto nodes based on the resource limits and requests defined in the deployment file. If a pod has a resource request that is higher than the capacity of the node, Kubernetes will not schedule the pod onto the node.
However, Kubernetes does not actively monitor the resource usage of pods and nodes and automatically adjusts resource limits and requests to avoid over-commitment. Instead, it is up to the administrator to set appropriate resource limits and requests for the pods based on the needs of the pods and the capacity of the nodes.
If a pod becomes unavailable due to resource over-commitment, Kubernetes will not automatically remove the pod from the node. Instead, it is up to the administrator to identify the cause of the issue and take steps to resolve it, such as adjusting the resource limits and requests for the pod or adding more nodes to the cluster to increase capacity.
How can we fix it?
To avoid this issue, it is important to?carefully consider the resource needs of your pods and set appropriate limits that do not exceed the capacity of the nodes in your cluster.
To address the issue described above, there are two possible solutions:
These resource limits would be within the capacity of the nodes, as they only require 4000 MI of CPU and 8000MB of memory.
2. Increase the resource requests values of the pods to be equal to the limits values
From this calculation, we see that this solution will require more nodes for our system — how many nodes?
Why is it important to subtract with 1000MI?
By subtracting 1000MI, you are accounting for the resources needed for the node processes, which will help to ensure that there are sufficient resources available for the pods to function correctly.
Our system will be required to scale up to 3 node
In both of the solutions, we will get a result that pods would be able to run successfully on the nodes without causing them to crash.
There are a few ways you can estimate the appropriate values for resource requests and limits for your pods:
Before making any changes to the resource requests and limits for your pods:
1. It is important to consider the potential impact on the number of nodes in your cluster. As I mentioned, increasing resource requests can lead to the need for more nodes in the cluster to accommodate the additional resource demands.
To calculate the number of nodes you will need after making changes to the resource requests and limits, you can use the formula described above.
2. Once you have calculated the number of nodes needed, it is important to check your IP range to ensure that there is sufficient space for new nodes to scale up as needed. You may also need to update your node group to accommodate the additional nodes.
By carefully considering the resource needs of your pods and making appropriate changes to the resource requests and limits, you can help to ensure that your cluster can meet the resource demands of your workloads and avoid issues like resource over-committing and node crashes.
Summary:
Monitoring the resource usage of your pods and nodes is also critical to identifying any potential issues early on and making adjustments as needed. Tools like Grafana and Prometheus can be incredibly helpful in this regard, as they allow you to view real-time and historical data on resource usage and identify any trends or anomalies that may be causing problems.
Overall, I believe that setting appropriate resource limits and requests is an essential aspect of effectively managing and operating a Kubernetes cluster in production. By taking the time to understand the resource needs of your microservices and setting appropriate limits and requests, you can help to ensure that your applications are stable, reliable, and performant.