Exploiting ChatGPT for TroubleShooting Kubernetes Problems
As a provider of Kubernetes-specific Observability solutions, Opscruise is always exploring new and innovative ways to troubleshoot problems in Kubernetes. With the recent excitement over the release of ChatGPT, a large-scale language model trained by OpenAI, and its promise to answer questions on a broad range of topics, we decided to put it to the test. In our case, we wanted to see if it can be used to help identify the root cause of failures in Kubernetes.?
If you are a user of Kubernetes in a cloud-native environment, you would know firsthand how difficult it can be to manually troubleshoot failures in a Kubernetes cluster -- see our eBook on Kubernetes. With so many moving parts and complex interactions between various components, such as nodes, pods, scheduler, and controllers, the fault isolation process can be time-consuming and requires in-depth technical knowledge and expertise, making it difficult even for many Ops/SREs to effectively diagnose and solve Kubernetes failures.
Because? ChatGPT is trained on a very large database of information, we expected it would have enough acquired knowledge on Kubernetes to help automate and speed up the troubleshooting process and perhaps be more effective. At Opscruise, we manage hundreds and thousands of Kubernetes pods for our customers and Opscruise anomaly detectors collect event logs when alerts occur.? By providing Kubernetes event logs as input, we expect ChatGPT to be able to quickly analyze the logs and help determine the root cause of a failure.? Using ChatGPT appropriately, we believe we can reduce the challenges of manual troubleshooting problems in a Kubernetes cluster.
In this blog post, we provide examples of Kubernetes problem scenarios and use ChatGPT to analyze Kubernetes event logs, and provide steps to resolution. We are excited to share our findings in this blog. So, without further ado, let's dive into the world of using ChatGPT for Kubernetes troubleshooting.
Scenario: Application Crashing after deployment
When we deploy an application on Kubernetes, it is common to observe an application crash. In this example, a deployment with an application crashed after one second. The CrashLoopBackOff status tells us that Kubernetes is trying to launch the pod, but one or more of the containers is crashing.? A test scenario is created in this example to crash the deployed application and then the observed outputs are given as input to the ChatGPT over a chat for troubleshooting and resolution as described below.
A new deployment is created with an application that crashes after 1 second:
$ kubectl run crasher --image=rosskukulinski/crashing-app
If we then observe the status of our Pods, we see the CrashLoopBackOff error that indicates that Kubernetes is trying to launch this Pod, but one or more of the containers is crashing or getting killed.
If we then observe the status of our Pods, we see the CrashLoopBackOff error that indicates that Kubernetes is trying to launch this Pod, but one or more of the containers is crashing or getting killed.
$ kubectl get pods
NAME ? ? ? ? ? ? ? ? ? ? ? READY ? ? STATUS ? ? ? ? ? ? RESTARTS ? AGE
crasher-2443551393-vuehs ? 0/1 ? ? ? CrashLoopBackOff ? 2? ? ? ? ? 54s
We then describe the pod and observe some more information as below.
$ kubectl describe pod crasher-2443551393-vueh
Name:? ? ? ? crasher-2443551393-vuehs
Namespace:? ? fail
Node:? ? ? ? gke-nrhk-1-default-pool-a101b974-wfp7/10.142.0.2
Start Time:? ? Fri, 10 Feb 2017 14:20:29 -0500
Labels:? ? ? ? pod-template-hash=2443551393
????????run=crasher
Status:? ? ? ? Running
IP:? ? ? ? 10.0.0.74
Controllers:? ? ReplicaSet/crasher-2443551393
Containers:
??crasher:
????Container ID:? ? docker://51c940ab32016e6d6b5ed28075357661fef3282cb3569117b0f815a199d01c60
????Image:? ? ? ? rosskukulinski/crashing-app
????Image ID:? ? ? ? docker://sha256:cf7452191b34d7797a07403d47a1ccf5254741d4bb356577b8a5de40864653a5
????Port:????????
????State:? ? ? ? Terminated
??????Reason:? ? ? ? Error
??????Exit Code:? ? 1
??????Started:? ? ? ? Fri, 10 Feb 2017 14:22:24 -0500
??????Finished:? ? ? ? Fri, 10 Feb 2017 14:22:26 -0500
????Last State:? ? ? ? Terminated
??????Reason:? ? ? ? Error
??????Exit Code:? ? 1
??????Started:? ? ? ? Fri, 10 Feb 2017 14:21:39 -0500
??????Finished:? ? ? ? Fri, 10 Feb 2017 14:21:40 -0500
????Ready:? ? ? ? False
????Restart Count:? ? 4
At this juncture, instead of doing a manual assessment of this problem, we prompted ChatGPT over a chat to troubleshoot the problem and recommend the steps for a resolution.
Below is ChatGPT's response:
ChatGPT > The problems identified from the event log are:
ChatGPT > Here are the steps to resolve the error:
The response from ChatGPT was relevant to the context of the problem. It provided a series of steps to troubleshoot and resolve the error with the pod and container. These steps include checking the pod logs, container environment variables and configuration, resource usage, and restart policy. If the error persists, the response recommends trying to restart the pod, scaling down the ReplicaSet, or recreating the ReplicaSet entirely with a corrected configuration.
Scenario:?Exceeding Resource Limits
In Kubernetes, IT Ops can set the limit for the amount of CPU or Memory allocated to Pods and Containers. From an application deployment perspective, a developer might not know about the predefined limits when application deployment fails.?
In this example, the Kubernetes output and event logs are input to the ChatGPT for troubleshooting and resolution as described below.
Here is an example of a deployment in a cluster with an unknown CPU/Memory request limit: Notice that the memory resource request?has been set to 5Gi.
领英推荐
# gateway.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
??name: gateway
spec:
??template:
????metadata:
??????labels:
????????app: gateway
????spec:
??????containers:
????????- name: test-container
??????????image: nginx
??????????resources:
????????????requests:
??????????????memory: 5Gil
With this, the deployment is done using kubectl create -f gateway.yaml and then when the Pods are checked, we see the “No resources found" issue.
$ kubectl get pods
No resources found.
If we inspect our deployment using describe, we can observe from the last line, the deployment created a ReplicaSet (gateway-764140025) and scaled it up to 1. The ReplicaSet is the entity that manages the lifecycle of the Pods.
$ kubectl describe deployment/gateway
Name:? ? ? ? ? ? gateway
Namespace:? ? ? ? fail
CreationTimestamp:? ? Sat, 11 Feb 2017 15:03:34 -0500
Labels:? ? ? ? ? ? app=gateway
Selector:? ? ? ? app=gateway
Replicas:? ? ? ? 0 updated | 1 total | 0 available | 1 unavailable
StrategyType:? ? ? ? RollingUpdate
MinReadySeconds:? ? 0
RollingUpdateStrategy:? ? 0 max unavailable, 1 max surge
OldReplicaSets:????????
NewReplicaSet:? ? ? ? gateway-764140025 (0/1 replicas created)
Events:
??FirstSeen? ? LastSeen? ? Count ? From? ? ? ? ? ? ? ? SubObjectPath ? Type? ? ? ? Reason? ? ? ? ? Message
??---------? ? --------? ? ----- ? ----? ? ? ? ? ? ? ? ------------- ? --------? ? ------? ? ? ? ? -------
??4m? ? ? ? 4m? ? ? 1 ? {deployment-controller }? ? ? ? ? ? Normal? ? ? ScalingReplicaSet ? Scaled up replica set gateway-764140025 to 1}y
We can describe the ReplicaSet and from the events, we can infer that IT Ops has set a maximum memory usage per Pod of 100Mi. (Note: we can inspect the current namespace limits by running kubectl describe limitrange.)
$ kubectl describe rs/gateway-76414002
Name:? ? ? ? gateway-764140025
Namespace:? ? fail
Image(s):? ? nginx
Selector:? ? app=gateway,pod-template-hash=764140025
Labels:? ? ? ? app=gateway
????????pod-template-hash=764140025
Replicas:? ? 0 current / 1 desired
Pods Status:? ? 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
No volumes.
Events:
??FirstSeen? ? LastSeen? ? Count ? From? ? ? ? ? ? ? ? SubObjectPath ? Type? ? ? ? Reason? ? ? Message
??---------? ? --------? ? ----- ? ----? ? ? ? ? ? ? ? ------------- ? --------? ? ------? ? ? -------
??6m? ? ? ? 28s ? ? 15? {replicaset-controller }? ? ? ? ? ? Warning ? ? FailedCreate? ? Error creating: pods "gateway-764140025-" is forbidden: [maximum memory usage per Pod is 100Mi, but request is 5368709120., maximum memory usage per Container is 100Mi, but request is 5Gi.]5
At this juncture, instead of doing a manual assessment of this problem, we prompted ChatGPT over a chat with this information to troubleshoot the problem and recommend the steps for a resolution.?
Below is ChatGPT's response:
ChatGpt > The root cause of the Kubernetes problems observed from the event logs appears to be the pods that belong to the gateway-764140025 .?
ReplicaSet was not able to be created due to exceeding the maximum memory usage per Pod or Container. The error message specifically mentions that the maximum memory usage per Pod is 100Mi, but the request was 5368709120, and the maximum memory usage per Container is 100Mi, but the request was 5Gi. This indicates that ReplicaSet was trying to create pods with a significantly larger amount of memory than what is allowed.
ChatGpt > To fix the problem, there are a few potential steps that could be taken:
The response from ChatGPT was relevant to the context of the problem. It suggested that the root cause of the Kubernetes problems observed in the event logs was due to the pods belonging to the gateway-764140025 ReplicaSet exceeding the maximum memory usage per Pod or Container. To fix the problem, ChatGPT suggested reviewing the configuration of the ReplicaSet to ensure it is not requesting too much memory, updating the memory limits for the cluster if necessary, or further investigating the logs and events for the ReplicaSet to diagnose any other issues preventing the pods from being created.
Combining ChatGPT and Opscruise Observability for Improved Kubernetes Decision Making
Based on the examples provided, it appears that ChatGPT can suggest natural language responses that provide insights into common Kubernetes issues such as the CrashLoopBackOff and resource underprovisioning errors. In the case of CrashLoopBackOff, a pod is unable to start or run properly, while the resource underprovisioning error occurs when ReplicaSets cannot be created when the Pod exceeds the maximum memory limit. ChatGPT offers insights for IT Ops to troubleshoot and fix these issues.
While ChatGPT's responses can provide a quick understanding of problems, ?customers often require more detailed, nuanced, and contextual information specific to an issue of a Kubernetes Pod or node. In this regard, Opscruise provides detailed Kubernetes alert features that measure and report on specific resource capacity and behavior changes. By combining ChatGPT as a complementary tool with Opscruise's Alert analysis which includes application state, alert activity, explanation, config views, event history, metric history charts, and logs customers can make more informed remediation decisions in situations like in the compute resource limits underprovisioning scenario such as deciding to increase the node's memory size or create a new node with more memory and add it to the cluster, reduce the request size for memory in the pod, or rebalance the cluster.
As ChatGPT is currently in its preview release, we are working on incorporating the powerful language generation capabilities of the OpenAI GPT-3 model. This will allow us to offer similar generative capabilities while focusing on improving the natural language-based Kubernetes troubleshooting.
For more information about Opscruise Observability, please visit the website.
Summary
We tested multiple Kubernetes failure scenarios such as for Application crashes, Probe failures, and Pod and Cluster level compute resource issues using ChatGPT.? It appears that it can analyze event logs and apply its acquired knowledge of Kubernetes to quickly isolate faults in Kubernetes applications. Customers using complex Kubernetes setups often require more detailed, quantified, and contextual information specific to different issues of? Kubernetes deployment, such as that provided by Opscruise.
We believe that this AI-based approach has the potential to significantly improve the efficiency and effectiveness of troubleshooting Kubernetes failures and accelerate the time to resolution. We believe combining OpenAI GPT AI models with Opscruise's core observability insights, customers can make more informed decisions.
References
Automotive Technology Leader: Connected Mobility, Cockpit, ADAS, Software Defined Vehicle
1 年Interesting study, Sankar!