A Day in the Life of an Engineer: Me and K8s CPU Throttling
Over the past few months (probably longer than I care to admit) I have thrown myself into this rabbit hole of optimizing our REST API built on Reactive Spring WebFlux framework or creating multiple variations of our Grafana Dashboards for that same service for observability and troubleshooting.
During these times I've definitely driven some of my co-workers crazy about making/review pull requests to ensure that "the reactive threads do not get blocked." I've been looking at Grafana metrics so much that I have at least 15 Grafana browser tabs at any given point in time and could navigate through anything in Grafana almost second nature (no, I'm not really proud of this "achievement")
There's been quite a number of improvements with these efforts (or at least from where I'm standing...let's go with that ??) but my heart will still skip a beat when I see uptick/spike of latency and couldn't really provide an explanation of what's happening.
Until one day one of the guys at the DevOps team told me on Slack that our application is "throttled." He advised to remove the cpu.limits to alleviate this. Now I know there are fights spawned on whether to set cpu limits or not, but let's not get into that and that's not really the point of this long-winded post.
Anyway.
We were still seeing occasional latencies on our Istio (sidecar) container and my stupid self did not realize right away that it's probably connected to the same CPU throttling issue. When I did realize that it could be that I created another Grafana visualization to see if it is that...lo and behold it is being CPU throttled.
Working with a colleague, we made adjustments to the Istio cpu resources and we have seen dramatic improvement on the CPU throttling...we still need to adjust some settings but it was a huge thorn on our side removed.
Just a bit of technical detail...there are a couple of Grafana (Prometheus) PromQL queries you can use to keep track of this. One is the the per-container percentage throttling:
sum by (container) (irate(container_cpu_cfs_throttled_periods_tota}[$__rate_interval]))
and the per-pod throttling:
avg(sum by (pod) (irate(container_cpu_cfs_throttled_periods_total[$__rate_interval])) / sum by (pod) (irate(container_cpu_cfs_periods_total[$__rate_interval])))
Both are oversimplified queries since it will vary a little depending on how your Grafana system is set up. More about it here: https://aws.amazon.com/blogs/containers/using-prometheus-to-avoid-disasters-with-kubernetes-cpu-limits/
When we originally set these settings in our Helm chart or whatever K8s configuration to get our cluster working, we didn't think much of this largely because we didn't quite understand how it affects the pod performance at the time.
Not to get too technical on how it works (and frankly I'm not still not 100% sure I understand this) but basically with K8s CPU provisioning you don't really allocate virtual CPU cores; you allocate time slices instead. This is why the CPU resource/limits are in units of time...which basically translates to "how much CPU time do the K8s gods allocate to your container/app". If the K8s scheduler is unable to allocate the CPU slices for the app, that's when you'll see the throttling. This is determined via the cpu.limits or whatever the node is running on can provide. Either way it's not great.
Of course, the most obvious way to get around this is ...increase the cpu resources...or...well, increase/scale the number of pods. But that becomes really untenable and costly.
So this still goes back to making sure that your application is optimized and horizontally-scalable, but the whole takeaway of this post is that outside being sure that your application is optimized, the K8s CPU resource allocation is something you can look at...
...and that SkyNet is upon us and all of this will not matter much in the near future.
SRE?? @ Amazon | Project Kuiper ???| DevOps, K8s, Terraform, Python... ??????
1 周feels strangely familiar ??