Kubernetes Scheduler in Details: Important Aspects. Part 1
What is Kubernetes Scheduler and what tasks does it solve?

Kubernetes Scheduler in Details: Important Aspects. Part 1

So, my author's piece was first published on Habr.

Recently, on one of the YouTube channels, I looked in detail at the work of Kubernetes Scheduler. In the process of preparing the material, I came across a lot of new and interesting facts that I would like to share with you. In this article, we will look at what exactly happens “under the hood” of Kubernetes Scheduler and what aspects are important for understanding its functioning.

I plan to go from simple to complex, so please be understanding. If you are already familiar with the basic concepts, feel free to skip the introduction and go straight to the key details.

If you ask an ordinary developer, how would you implement k8s scheduler?

The answer will most likely be in this style:

while True:
    pods = get_all_pods()
    for pod in pods:
        if pod.node == nil:
            assignNode(pod)
//But this article wouldn't exist if everything was so simple.        

What is Kubernetes Scheduler and what tasks does it solve?

The Kubernetes Scheduler is responsible for distributing Pods to worker nodes in a cluster. The main task of the scheduler is to optimize the placement of pods, taking into account the available resources on the nodes, the requirements of each pod, and various other factors.

If you ask me to describe the functions of Kubernetes Scheduler in a nutshell, I would highlight two key tasks::

  1. Selecting a suitable node for hosting the pod. In this process, the scheduler performs an analysis to make sure that the pod can be effectively launched on the selected node.
  2. Pinning the pod to the selected node.

Where is Kubernetes Scheduler located in the Kubernetes architecture?

If you try to represent the sequence of actions that occur when creating a pod, you will get the following scheme:

Kubernetes Components

An image below shows the sequence of actions that occur when creating a pod.

Pod creation flow

  1. A Pod is created by a Controller that is responsible for the state of Deployment and ReplicaSet, or directly via the API manually (for example, via kubectl apply).
  2. The scheduler picks up a new pod
  3. Kubelet (not included in the scheduler) creates and runs Pod containers on the worker Node
  4. Kubelet clears unnecessary data about the Pod after it is deleted

How does Kubernetes Scheduler work in a basic sense?

  1. Information Collection: The scheduler constantly monitors the cluster status, collecting data about available nodes, their resources (CPU, memory, and so on), current pod placement (Pods), and their requirements.
  2. Candidate identification: As soon as a new sub appears that requires placement, the scheduler initiates the process of selecting a suitable worker node. The first step is to create a list of all available nodes that meet the basic requirements of the pod, such as the processor architecture, the amount of available memory, and so on.
  3. Filtering: Nodes that do not meet the additional requirements and restrictions specified in the pod specification are removed from the created list. These can include, for example, the affinity/anti-affinity, tints, and tolerations rules.
  4. Ranking: After the filtering stage, the scheduler ranks the remaining nodes to select the most suitable one for hosting the pod.
  5. Binding a pod to a node: At this stage, a pod is assigned to the selected work node, and the corresponding record is added to etcd. After that, kubelet detects a new task and initiates the process of creating and launching containers.
  6. Status Update: After a pod is successfully placed, its location information is updated in etcd (which is a single data source) so that other system components can access this information.

All these steps are called " Extension points”(they are also plugins) that allow you to extend the scheduler functionality. They are implemented thanks to the Scheduler Framework, which we will discuss in part 2 of the article. For example, you can add new filters or ranking algorithms to meet the specific requirements of your application. In fact, there are many more plugins, and we'll get back to that in the next part.

The process repeats: The scheduler continues to monitor the cluster for the next pod that needs to be placed.

A simplified view of the planning process is shown in the figure below. We'll cover this process in more detail in Part 2, but for now, let's take a look at how the scheduler works in a basic sense.

Scheduler algorithm. The image in the diagram is deliberately simplified, so as not to clutter. We will discuss this process in detail in Part 2.

In the k8s documentation, this process has the same structure, but is shown in a more general form. Event handler is displayed instead of informer. Informers use event handlers to trigger specific actions when a change is detected in the cluster. For example, if a new pod is created and needs to be scheduled, the informer event handler activates the scheduling algorithm for that particular pod.

Informer: The Kubernetes scheduler actively uses a mechanism called "Informer" to monitor cluster health. Informer is a set of controllers that continuously monitor certain resources in etcd. When changes are detected, the information is updated in the scheduler's internal cache. This cache allows you to optimize resource consumption and provide up-to-date data about nodes, pods, and other cluster elements.

Schedule Pipeline: The scheduling process in Kubernetes starts with adding new pods to the queue. This operation is performed using the Informer component. Pods are then extracted from this queue and passed through the so-called "Schedule Pipeline" - a chain of steps and checks, after which the final placement of the pod takes place on a suitable node.

The Schedule Pipeline is divided into 3 streams.

  1. Main thread: As you can see in the image, the main thread performs filtering, ranking, and pre-reserving steps.Filter - it is still clear here, there is all sorts of filtering out unsuitable nodes.Score-this plugin ranks nodes, i.e. selects the most suitable node for the pod from all the remaining ones.Reserve - this is where resources are pre-reserved on the pod node. This is necessary to prevent other pods from taking up these resources(preventing the race condition situation). This plugin also implements the UnReserve method.UnReserve is a method that is part of the Reserve plugin. This method is used to release resources on the node that were previously reserved for the pod. This method is called if the pod has not been linked to the node for a certain time(timeout) or the Permit plugin has assigned the deny status for the current pod. This is necessary so that other pods can take up these resources.
  2. Permit thread: This phase is used to prevent the pod from hanging in a suspended (unknown) state. This plugin can do one of 3 things::approve - All previous plugins have confirmed that the pod can be run on the node. So the final solution for the pod is approve.deny - One of the previous plugins did not return a positive result. So the final solution for the pod is deny.wait - If the permit plugin returns " wait”, the sub remains in the permit phase until the sub receives the approve or deny status. If a timeout occurs “ "wait" becomes "deny", and the pod returns to the scheduling queue, activating the Un-reserve method in the Reserve phase.
  3. Bind thread: This part is responsible for adding a record that the pod was linked to the node.Pre-bind-here are the steps that need to be performed before binding the pod to the node. For example, creating network storage and linking it to a node.Bind - here the pod is bound to the node.Post-Bind is the very last step that is performed after binding the pod to the node. This step can be used for both cleaning and performing additional actions.

The Schedule Pipeline also uses Cache to store data about pods.

Important aspect:

  1. In the Main and Permit threads, pods are scheduled exclusively sequentially, one after the other. This means that the scheduler cannot schedule multiple pods simultaneously in the Main thread and Permit thread.
  2. The Reserve method of the Reserve plugin stands out. This method can be called from the Main thread, Permit thread, or Bind thread.

This restriction was introduced in order to avoid a situation where several pods are trying to occupy the same resources on the node. All other threads can be executed asynchronously.

Let's move on to practice and feel everything on our own

1. Create a new pod

To make the scheduler work, create a new pod using the kubectl apply command. Creating a pod using deployment.

It is important to note that the scheduler works only with pods, and the state of Deployment and ReplicaSet is monitored by the controller.

kubectl apply -f https://k8s.io/examples/controllers/nginx-deployment.yaml        

nginx-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80        
Creating a deployment

2. The controller creates pods

In fact, we will create a Deployment, which in turn will create a ReplicaSet, which in turn will create a sub.

The controller that is responsible for the state of Deployment and ReplicaSet sees the corresponding new objects and starts its work.

The controller will see the deployment above and create something like this ReplicaSet object

ReplicaSet:

apiVersion: v1
items:
  - apiVersion: apps/v1
    kind: ReplicaSet
    metadata:
      annotations:
        deployment.kubernetes.io/desired-replicas: "3"
        deployment.kubernetes.io/max-replicas: "4"
        deployment.kubernetes.io/revision: "1"
      labels:
        app: nginx
      name: nginx-deployment-85996f8dbd
      namespace: default
      ownerReferences:
        - apiVersion: apps/v1
          blockOwnerDeletion: true
          controller: true
          kind: Deployment
          name: nginx-deployment
          uid: b8a1b12e-94fc-4472-a14d-7b3e2681e119
      resourceVersion: "127556139"
      uid: 8140214d-204d-47c4-9538-aff317507dd2
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: nginx
          pod-template-hash: 85996f8dbd
      template:
        metadata:
          labels:
            app: nginx
            pod-template-hash: 85996f8dbd
        spec:
          containers:
            - image: nginx:1.14.2
              imagePullPolicy: IfNotPresent
              name: nginx
              ports:
                - containerPort: 80
                  protocol: TCP
              resources: {}
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
    status:
      availableReplicas: 3
      fullyLabeledReplicas: 3
      observedGeneration: 1
      readyReplicas: 3
      replicas: 3
kind: List        

As a result of the operation of the controller that is responsible for replicaset, 3 pods will be created. They get the Pending status because the scheduler hasn't scheduled them for nodes yet. These pods are added to the scheduler's queue.

3. The planner steps in

In this way, it can be displayed in our diagram.

Each sub in the scheduler queue is fetched in turn order and:

  1. Passes through the Schedule Pipeline, selects the most suitable node
  2. Assigned to the selected node

Planning phase

I may repeat myself a little, but here's a little more detail about the pipeline itself.


Filter - we filter out unsuitable nodes.

  • For example, if we want to place a pod on a node that has a GPU, then we don't immediately need nodes without a GPU.

  • Next, we remove nodes that don't have enough resources to run the pod. For example, if a pod requires 2 CPUs, but the node has only 1 CPU, then this node is not suitable.

  • etc ... There can be quite a lot of iterations of filtering, we will return to this in chapter 2.

Score-we sort the remaining nodes. If there is more than one node, we need to somehow choose the most suitable node, and not just use random. This is where various plugins come into play. For example, the ImageLocality plugin allows you to select a node that already has an image of the container that we want to run. This saves you time downloading an image from the container registry.

Reserve - we reserve resources on the node for the pod. So that the resources of our ideal node are not taken away in the next stream, we reserve this node.

Un-Reserve - if something goes wrong at any stage, we call this method to free up resources on the node and send them back to the scheduler queue.

Permit - we check that the pod can be run on the node. If all the previous steps were successful, then we check that the pod can be launched on the node. For example, if we have an affinity rule that says that a pod must be run on a node with a specific label, then we check that this node matches this rule. If everything is fine, then we return the approve status, if not, then deny.

Pinning phase

In this phase, we perform additional steps before the final pinning of the node, the pinning of the pod itself to the node, and the necessary steps after pinning. For more details about this phase, see Part 2.

It is important to note that this thread runs asynchronously.

Kubelet-launching a container on a node

As soon as we have secured the node to the pod, kubelet sees these changes and starts the process of launching the container on the node. Once again, kubelet is a non-scheduler component.

The pod is running on the most suitable node, and we can see this in the output of the kubectl get pods command. This means that the scheduler has done its job.

kubectl get pods -o wide        

This is what the Schedule Pipeline looks like in a simplified form, and we will look at it in detail in Part 2.

In the next part, we'll dig deeper and learn more about the planner's internal kitchen. In particular, we:

  • Let's analyze the Scheduler framework
  • Learn how to extend the scheduler functionality
  • Drawing back the curtain on the scheduler queue
  • Let's take a look at the plugin examples

There will be a to the 2nd part.

要查看或添加评论,请登录

Kirill Kazakov的更多文章

社区洞察

其他会员也浏览了