Exploiting ChatGPT for TroubleShooting Kubernetes Problems

Exploiting ChatGPT for TroubleShooting Kubernetes Problems

As a provider of Kubernetes-specific Observability solutions, Opscruise is always exploring new and innovative ways to troubleshoot problems in Kubernetes. With the recent excitement over the release of ChatGPT, a large-scale language model trained by OpenAI, and its promise to answer questions on a broad range of topics, we decided to put it to the test. In our case, we wanted to see if it can be used to help identify the root cause of failures in Kubernetes.?

If you are a user of Kubernetes in a cloud-native environment, you would know firsthand how difficult it can be to manually troubleshoot failures in a Kubernetes cluster -- see our eBook on Kubernetes. With so many moving parts and complex interactions between various components, such as nodes, pods, scheduler, and controllers, the fault isolation process can be time-consuming and requires in-depth technical knowledge and expertise, making it difficult even for many Ops/SREs to effectively diagnose and solve Kubernetes failures.

Because? ChatGPT is trained on a very large database of information, we expected it would have enough acquired knowledge on Kubernetes to help automate and speed up the troubleshooting process and perhaps be more effective. At Opscruise, we manage hundreds and thousands of Kubernetes pods for our customers and Opscruise anomaly detectors collect event logs when alerts occur.? By providing Kubernetes event logs as input, we expect ChatGPT to be able to quickly analyze the logs and help determine the root cause of a failure.? Using ChatGPT appropriately, we believe we can reduce the challenges of manual troubleshooting problems in a Kubernetes cluster.

In this blog post, we provide examples of Kubernetes problem scenarios and use ChatGPT to analyze Kubernetes event logs, and provide steps to resolution. We are excited to share our findings in this blog. So, without further ado, let's dive into the world of using ChatGPT for Kubernetes troubleshooting.

Scenario: Application Crashing after deployment

When we deploy an application on Kubernetes, it is common to observe an application crash. In this example, a deployment with an application crashed after one second. The CrashLoopBackOff status tells us that Kubernetes is trying to launch the pod, but one or more of the containers is crashing.? A test scenario is created in this example to crash the deployed application and then the observed outputs are given as input to the ChatGPT over a chat for troubleshooting and resolution as described below.

A new deployment is created with an application that crashes after 1 second:

$ kubectl run crasher --image=rosskukulinski/crashing-app        

If we then observe the status of our Pods, we see the CrashLoopBackOff error that indicates that Kubernetes is trying to launch this Pod, but one or more of the containers is crashing or getting killed.

If we then observe the status of our Pods, we see the CrashLoopBackOff error that indicates that Kubernetes is trying to launch this Pod, but one or more of the containers is crashing or getting killed.

$ kubectl get pods

NAME ? ? ? ? ? ? ? ? ? ? ? READY ? ? STATUS ? ? ? ? ? ? RESTARTS ? AGE

crasher-2443551393-vuehs ? 0/1 ? ? ? CrashLoopBackOff ? 2? ? ? ? ? 54s        

We then describe the pod and observe some more information as below.

$ kubectl describe pod crasher-2443551393-vueh

Name:? ? ? ? crasher-2443551393-vuehs

Namespace:? ? fail

Node:? ? ? ? gke-nrhk-1-default-pool-a101b974-wfp7/10.142.0.2

Start Time:? ? Fri, 10 Feb 2017 14:20:29 -0500

Labels:? ? ? ? pod-template-hash=2443551393

????????run=crasher

Status:? ? ? ? Running

IP:? ? ? ? 10.0.0.74

Controllers:? ? ReplicaSet/crasher-2443551393

Containers:

??crasher:

????Container ID:? ? docker://51c940ab32016e6d6b5ed28075357661fef3282cb3569117b0f815a199d01c60

????Image:? ? ? ? rosskukulinski/crashing-app

????Image ID:? ? ? ? docker://sha256:cf7452191b34d7797a07403d47a1ccf5254741d4bb356577b8a5de40864653a5

????Port:????????

????State:? ? ? ? Terminated

??????Reason:? ? ? ? Error

??????Exit Code:? ? 1

??????Started:? ? ? ? Fri, 10 Feb 2017 14:22:24 -0500

??????Finished:? ? ? ? Fri, 10 Feb 2017 14:22:26 -0500

????Last State:? ? ? ? Terminated

??????Reason:? ? ? ? Error

??????Exit Code:? ? 1

??????Started:? ? ? ? Fri, 10 Feb 2017 14:21:39 -0500

??????Finished:? ? ? ? Fri, 10 Feb 2017 14:21:40 -0500

????Ready:? ? ? ? False

????Restart Count:? ? 4        

At this juncture, instead of doing a manual assessment of this problem, we prompted ChatGPT over a chat to troubleshoot the problem and recommend the steps for a resolution.

Below is ChatGPT's response:

ChatGPT > The problems identified from the event log are:

  1. The pod "crasher-2443551393-vuehs" is in a CrashLoopBackOff state, indicating that it is crashing repeatedly.
  2. The container "crasher" in the pod has terminated with an exit code of 1, indicating that it has exited with an error.
  3. The container has been restarted multiple times but continues to crash.

ChatGPT > Here are the steps to resolve the error:

  • Check the pod logs using the command kubectl logs crasher-2443551393-vuehs. This will provide more information about the error and why the container is crashing.
  • Check the container's environment variables and configuration using the command kubectl describe pod crasher-2443551393-vuehs. This will give an idea of any potential misconfigurations or missing dependencies.
  • Check the container's resource usage using the command kubectl top pod crasher-2443551393-vuehs. This will help identify if the container is running out of memory or CPU resources.
  • Check the container's restart policy using the command kubectl describe pod crasher-2443551393-vuehs. If the policy is set to never restart, it may be necessary to update it to allow for automatic restarts.
  • If the error persists, try restarting the pod using the command kubectl delete pod crasher-2443551393-vuehs and letting it automatically be recreated by the ReplicaSet.
  • If the error still persists, try scaling down the ReplicaSet to 0 replicas using the command kubectl scale replicaset crasher-2443551393 --replicas=0. Then, update the ReplicaSet's configuration to address the issue and scale it back up to the desired number of replicas using the command kubectl scale replicaset crasher-2443551393 --replicas=3.
  • If the issue still persists, it may be necessary to recreate the ReplicaSet entirely using a new configuration. To do this, first delete the existing ReplicaSet using the command kubectl delete replicaset crasher-2443551393, then create a new ReplicaSet using a corrected configuration file.

The response from ChatGPT was relevant to the context of the problem. It provided a series of steps to troubleshoot and resolve the error with the pod and container. These steps include checking the pod logs, container environment variables and configuration, resource usage, and restart policy. If the error persists, the response recommends trying to restart the pod, scaling down the ReplicaSet, or recreating the ReplicaSet entirely with a corrected configuration.

Scenario:?Exceeding Resource Limits

In Kubernetes, IT Ops can set the limit for the amount of CPU or Memory allocated to Pods and Containers. From an application deployment perspective, a developer might not know about the predefined limits when application deployment fails.?

In this example, the Kubernetes output and event logs are input to the ChatGPT for troubleshooting and resolution as described below.

Here is an example of a deployment in a cluster with an unknown CPU/Memory request limit: Notice that the memory resource request?has been set to 5Gi.

# gateway.yaml

apiVersion: extensions/v1beta1

kind: Deployment

metadata:

??name: gateway

spec:

??template:

????metadata:

??????labels:

????????app: gateway

????spec:

??????containers:

????????- name: test-container

??????????image: nginx

??????????resources:

????????????requests:

??????????????memory: 5Gil        

With this, the deployment is done using kubectl create -f gateway.yaml and then when the Pods are checked, we see the “No resources found" issue.

$ kubectl get pods

No resources found.        

If we inspect our deployment using describe, we can observe from the last line, the deployment created a ReplicaSet (gateway-764140025) and scaled it up to 1. The ReplicaSet is the entity that manages the lifecycle of the Pods.

$ kubectl describe deployment/gateway

Name:? ? ? ? ? ? gateway

Namespace:? ? ? ? fail

CreationTimestamp:? ? Sat, 11 Feb 2017 15:03:34 -0500

Labels:? ? ? ? ? ? app=gateway

Selector:? ? ? ? app=gateway

Replicas:? ? ? ? 0 updated | 1 total | 0 available | 1 unavailable

StrategyType:? ? ? ? RollingUpdate

MinReadySeconds:? ? 0

RollingUpdateStrategy:? ? 0 max unavailable, 1 max surge

OldReplicaSets:????????

NewReplicaSet:? ? ? ? gateway-764140025 (0/1 replicas created)

Events:

??FirstSeen? ? LastSeen? ? Count ? From? ? ? ? ? ? ? ? SubObjectPath ? Type? ? ? ? Reason? ? ? ? ? Message

??---------? ? --------? ? ----- ? ----? ? ? ? ? ? ? ? ------------- ? --------? ? ------? ? ? ? ? -------

??4m? ? ? ? 4m? ? ? 1 ? {deployment-controller }? ? ? ? ? ? Normal? ? ? ScalingReplicaSet ? Scaled up replica set gateway-764140025 to 1}y        

We can describe the ReplicaSet and from the events, we can infer that IT Ops has set a maximum memory usage per Pod of 100Mi. (Note: we can inspect the current namespace limits by running kubectl describe limitrange.)

$ kubectl describe rs/gateway-76414002

Name:? ? ? ? gateway-764140025

Namespace:? ? fail

Image(s):? ? nginx

Selector:? ? app=gateway,pod-template-hash=764140025

Labels:? ? ? ? app=gateway

????????pod-template-hash=764140025

Replicas:? ? 0 current / 1 desired

Pods Status:? ? 0 Running / 0 Waiting / 0 Succeeded / 0 Failed

No volumes.

Events:

??FirstSeen? ? LastSeen? ? Count ? From? ? ? ? ? ? ? ? SubObjectPath ? Type? ? ? ? Reason? ? ? Message

??---------? ? --------? ? ----- ? ----? ? ? ? ? ? ? ? ------------- ? --------? ? ------? ? ? -------

??6m? ? ? ? 28s ? ? 15? {replicaset-controller }? ? ? ? ? ? Warning ? ? FailedCreate? ? Error creating: pods "gateway-764140025-" is forbidden: [maximum memory usage per Pod is 100Mi, but request is 5368709120., maximum memory usage per Container is 100Mi, but request is 5Gi.]5        

At this juncture, instead of doing a manual assessment of this problem, we prompted ChatGPT over a chat with this information to troubleshoot the problem and recommend the steps for a resolution.?

Below is ChatGPT's response:

ChatGpt > The root cause of the Kubernetes problems observed from the event logs appears to be the pods that belong to the gateway-764140025 .?

ReplicaSet was not able to be created due to exceeding the maximum memory usage per Pod or Container. The error message specifically mentions that the maximum memory usage per Pod is 100Mi, but the request was 5368709120, and the maximum memory usage per Container is 100Mi, but the request was 5Gi. This indicates that ReplicaSet was trying to create pods with a significantly larger amount of memory than what is allowed.

ChatGpt > To fix the problem, there are a few potential steps that could be taken:

  1. Review the configuration of the gateway-764140025 ReplicaSet to ensure that it is not requesting more memory than is allowed. This can be done by running the kubectl describe rs/gateway-764140025 command and looking at the spec.template.spec.containers section, which should specify the amount of memory requested by each container in the ReplicaSet. If the memory requests are too high, they can be reduced to fit within the maximum allowed amount.
  2. If the memory requests are already within the allowed limits, it is possible that the Kubernetes cluster itself has been configured with a lower memory limit than what is being requested by ReplicaSet. In this case, the IT Ops would need to update the memory limits for the cluster to allow for higher memory usage. This can be done by modifying the --kube-reserved and --system-reserved flags for the kubelet service, which control the amount of memory reserved for the system and other critical components.
  3. Alternatively, if the memory requests are within the allowed limits and the cluster itself has sufficient memory, it is possible that there are other issues preventing the pods from being created. In this case, it would be necessary to further investigate the logs and events for the ReplicaSet to identify and diagnose the specific issue. This can be done by running the kubectl describe rs/gateway-764140025 command and looking at the Events section, which should provide more information about any errors or failures that occurred when creating the pods.

The response from ChatGPT was relevant to the context of the problem. It suggested that the root cause of the Kubernetes problems observed in the event logs was due to the pods belonging to the gateway-764140025 ReplicaSet exceeding the maximum memory usage per Pod or Container. To fix the problem, ChatGPT suggested reviewing the configuration of the ReplicaSet to ensure it is not requesting too much memory, updating the memory limits for the cluster if necessary, or further investigating the logs and events for the ReplicaSet to diagnose any other issues preventing the pods from being created.

Combining ChatGPT and Opscruise Observability for Improved Kubernetes Decision Making

Based on the examples provided, it appears that ChatGPT can suggest natural language responses that provide insights into common Kubernetes issues such as the CrashLoopBackOff and resource underprovisioning errors. In the case of CrashLoopBackOff, a pod is unable to start or run properly, while the resource underprovisioning error occurs when ReplicaSets cannot be created when the Pod exceeds the maximum memory limit. ChatGPT offers insights for IT Ops to troubleshoot and fix these issues.

While ChatGPT's responses can provide a quick understanding of problems, ?customers often require more detailed, nuanced, and contextual information specific to an issue of a Kubernetes Pod or node. In this regard, Opscruise provides detailed Kubernetes alert features that measure and report on specific resource capacity and behavior changes. By combining ChatGPT as a complementary tool with Opscruise's Alert analysis which includes application state, alert activity, explanation, config views, event history, metric history charts, and logs customers can make more informed remediation decisions in situations like in the compute resource limits underprovisioning scenario such as deciding to increase the node's memory size or create a new node with more memory and add it to the cluster, reduce the request size for memory in the pod, or rebalance the cluster.

As ChatGPT is currently in its preview release, we are working on incorporating the powerful language generation capabilities of the OpenAI GPT-3 model. This will allow us to offer similar generative capabilities while focusing on improving the natural language-based Kubernetes troubleshooting.

For more information about Opscruise Observability, please visit the website.

Summary

We tested multiple Kubernetes failure scenarios such as for Application crashes, Probe failures, and Pod and Cluster level compute resource issues using ChatGPT.? It appears that it can analyze event logs and apply its acquired knowledge of Kubernetes to quickly isolate faults in Kubernetes applications. Customers using complex Kubernetes setups often require more detailed, quantified, and contextual information specific to different issues of? Kubernetes deployment, such as that provided by Opscruise.

We believe that this AI-based approach has the potential to significantly improve the efficiency and effectiveness of troubleshooting Kubernetes failures and accelerate the time to resolution. We believe combining OpenAI GPT AI models with Opscruise's core observability insights, customers can make more informed decisions.

References

  1. https://openai.com/blog/chatgpt/
  2. ?Kubernetes failures and event logs in the public domain. https://kukulinski.com/?
  3. Kubernetes Troubleshooting : A hands on practitioners guide

#ChatGpt #Kubernetes #AiOps #Observability #Opscruise

Kannan Varadhan

Automotive Technology Leader: Connected Mobility, Cockpit, ADAS, Software Defined Vehicle

1 年

Interesting study, Sankar!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了