Kubernetes Scheduler in Details: Important Aspects. Part 1
Kirill Kazakov
Senior Cloud DevOps Engineer | Kubernetes Certified (CKA) | AWS Certified Solutions Architect | Author | Speaker
So, my author's piece was first published on Habr.
Recently, on one of the YouTube channels, I looked in detail at the work of Kubernetes Scheduler. In the process of preparing the material, I came across a lot of new and interesting facts that I would like to share with you. In this article, we will look at what exactly happens “under the hood” of Kubernetes Scheduler and what aspects are important for understanding its functioning.
I plan to go from simple to complex, so please be understanding. If you are already familiar with the basic concepts, feel free to skip the introduction and go straight to the key details.
If you ask an ordinary developer, how would you implement k8s scheduler?
The answer will most likely be in this style:
while True:
pods = get_all_pods()
for pod in pods:
if pod.node == nil:
assignNode(pod)
//But this article wouldn't exist if everything was so simple.
What is Kubernetes Scheduler and what tasks does it solve?
The Kubernetes Scheduler is responsible for distributing Pods to worker nodes in a cluster. The main task of the scheduler is to optimize the placement of pods, taking into account the available resources on the nodes, the requirements of each pod, and various other factors.
If you ask me to describe the functions of Kubernetes Scheduler in a nutshell, I would highlight two key tasks::
Where is Kubernetes Scheduler located in the Kubernetes architecture?
If you try to represent the sequence of actions that occur when creating a pod, you will get the following scheme:
An image below shows the sequence of actions that occur when creating a pod.
How does Kubernetes Scheduler work in a basic sense?
All these steps are called " Extension points”(they are also plugins) that allow you to extend the scheduler functionality. They are implemented thanks to the Scheduler Framework, which we will discuss in part 2 of the article. For example, you can add new filters or ranking algorithms to meet the specific requirements of your application. In fact, there are many more plugins, and we'll get back to that in the next part.
The process repeats: The scheduler continues to monitor the cluster for the next pod that needs to be placed.
A simplified view of the planning process is shown in the figure below. We'll cover this process in more detail in Part 2, but for now, let's take a look at how the scheduler works in a basic sense.
In the k8s documentation, this process has the same structure, but is shown in a more general form. Event handler is displayed instead of informer. Informers use event handlers to trigger specific actions when a change is detected in the cluster. For example, if a new pod is created and needs to be scheduled, the informer event handler activates the scheduling algorithm for that particular pod.
Informer: The Kubernetes scheduler actively uses a mechanism called "Informer" to monitor cluster health. Informer is a set of controllers that continuously monitor certain resources in etcd. When changes are detected, the information is updated in the scheduler's internal cache. This cache allows you to optimize resource consumption and provide up-to-date data about nodes, pods, and other cluster elements.
Schedule Pipeline: The scheduling process in Kubernetes starts with adding new pods to the queue. This operation is performed using the Informer component. Pods are then extracted from this queue and passed through the so-called "Schedule Pipeline" - a chain of steps and checks, after which the final placement of the pod takes place on a suitable node.
The Schedule Pipeline is divided into 3 streams.
The Schedule Pipeline also uses Cache to store data about pods.
Important aspect:
This restriction was introduced in order to avoid a situation where several pods are trying to occupy the same resources on the node. All other threads can be executed asynchronously.
Let's move on to practice and feel everything on our own
1. Create a new pod
To make the scheduler work, create a new pod using the kubectl apply command. Creating a pod using deployment.
It is important to note that the scheduler works only with pods, and the state of Deployment and ReplicaSet is monitored by the controller.
kubectl apply -f https://k8s.io/examples/controllers/nginx-deployment.yaml
nginx-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
2. The controller creates pods
In fact, we will create a Deployment, which in turn will create a ReplicaSet, which in turn will create a sub.
The controller that is responsible for the state of Deployment and ReplicaSet sees the corresponding new objects and starts its work.
The controller will see the deployment above and create something like this ReplicaSet object
ReplicaSet:
apiVersion: v1
items:
- apiVersion: apps/v1
kind: ReplicaSet
metadata:
annotations:
deployment.kubernetes.io/desired-replicas: "3"
deployment.kubernetes.io/max-replicas: "4"
deployment.kubernetes.io/revision: "1"
labels:
app: nginx
name: nginx-deployment-85996f8dbd
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: Deployment
name: nginx-deployment
uid: b8a1b12e-94fc-4472-a14d-7b3e2681e119
resourceVersion: "127556139"
uid: 8140214d-204d-47c4-9538-aff317507dd2
spec:
replicas: 3
selector:
matchLabels:
app: nginx
pod-template-hash: 85996f8dbd
template:
metadata:
labels:
app: nginx
pod-template-hash: 85996f8dbd
spec:
containers:
- image: nginx:1.14.2
imagePullPolicy: IfNotPresent
name: nginx
ports:
- containerPort: 80
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
status:
availableReplicas: 3
fullyLabeledReplicas: 3
observedGeneration: 1
readyReplicas: 3
replicas: 3
kind: List
As a result of the operation of the controller that is responsible for replicaset, 3 pods will be created. They get the Pending status because the scheduler hasn't scheduled them for nodes yet. These pods are added to the scheduler's queue.
3. The planner steps in
In this way, it can be displayed in our diagram.
Each sub in the scheduler queue is fetched in turn order and:
Planning phase
I may repeat myself a little, but here's a little more detail about the pipeline itself.
Filter - we filter out unsuitable nodes.
Score-we sort the remaining nodes. If there is more than one node, we need to somehow choose the most suitable node, and not just use random. This is where various plugins come into play. For example, the ImageLocality plugin allows you to select a node that already has an image of the container that we want to run. This saves you time downloading an image from the container registry.
Reserve - we reserve resources on the node for the pod. So that the resources of our ideal node are not taken away in the next stream, we reserve this node.
Un-Reserve - if something goes wrong at any stage, we call this method to free up resources on the node and send them back to the scheduler queue.
Permit - we check that the pod can be run on the node. If all the previous steps were successful, then we check that the pod can be launched on the node. For example, if we have an affinity rule that says that a pod must be run on a node with a specific label, then we check that this node matches this rule. If everything is fine, then we return the approve status, if not, then deny.
Pinning phase
In this phase, we perform additional steps before the final pinning of the node, the pinning of the pod itself to the node, and the necessary steps after pinning. For more details about this phase, see Part 2.
It is important to note that this thread runs asynchronously.
Kubelet-launching a container on a node
As soon as we have secured the node to the pod, kubelet sees these changes and starts the process of launching the container on the node. Once again, kubelet is a non-scheduler component.
The pod is running on the most suitable node, and we can see this in the output of the kubectl get pods command. This means that the scheduler has done its job.
kubectl get pods -o wide
This is what the Schedule Pipeline looks like in a simplified form, and we will look at it in detail in Part 2.
In the next part, we'll dig deeper and learn more about the planner's internal kitchen. In particular, we:
There will be a to the 2nd part.