登录查看更多内容

A Day in the Life of an Engineer: Me and K8s CPU Throttling

Dexter Legaspi, MSc

发布日期: 2025年3月7日

Over the past few months (probably longer than I care to admit) I have thrown myself into this rabbit hole of optimizing our REST API built on Reactive Spring WebFlux framework or creating multiple variations of our Grafana Dashboards for that same service for observability and troubleshooting.

During these times I've definitely driven some of my co-workers crazy about making/review pull requests to ensure that "the reactive threads do not get blocked." I've been looking at Grafana metrics so much that I have at least 15 Grafana browser tabs at any given point in time and could navigate through anything in Grafana almost second nature (no, I'm not really proud of this "achievement")

There's been quite a number of improvements with these efforts (or at least from where I'm standing...let's go with that ??) but my heart will still skip a beat when I see uptick/spike of latency and couldn't really provide an explanation of what's happening.

Until one day one of the guys at the DevOps team told me on Slack that our application is "throttled." He advised to remove the cpu.limits to alleviate this. Now I know there are fights spawned on whether to set cpu limits or not, but let's not get into that and that's not really the point of this long-winded post.

Anyway.

We were still seeing occasional latencies on our Istio (sidecar) container and my stupid self did not realize right away that it's probably connected to the same CPU throttling issue. When I did realize that it could be that I created another Grafana visualization to see if it is that...lo and behold it is being CPU throttled.

Working with a colleague, we made adjustments to the Istio cpu resources and we have seen dramatic improvement on the CPU throttling...we still need to adjust some settings but it was a huge thorn on our side removed.

Just a bit of technical detail...there are a couple of Grafana (Prometheus) PromQL queries you can use to keep track of this. One is the the per-container percentage throttling:

sum by (container) (irate(container_cpu_cfs_throttled_periods_tota}[$__rate_interval]))

and the per-pod throttling:

avg(sum by (pod) (irate(container_cpu_cfs_throttled_periods_total[$__rate_interval])) / sum by (pod) (irate(container_cpu_cfs_periods_total[$__rate_interval])))

Both are oversimplified queries since it will vary a little depending on how your Grafana system is set up. More about it here: https://aws.amazon.com/blogs/containers/using-prometheus-to-avoid-disasters-with-kubernetes-cpu-limits/

When we originally set these settings in our Helm chart or whatever K8s configuration to get our cluster working, we didn't think much of this largely because we didn't quite understand how it affects the pod performance at the time.

Not to get too technical on how it works (and frankly I'm not still not 100% sure I understand this) but basically with K8s CPU provisioning you don't really allocate virtual CPU cores; you allocate time slices instead. This is why the CPU resource/limits are in units of time...which basically translates to "how much CPU time do the K8s gods allocate to your container/app". If the K8s scheduler is unable to allocate the CPU slices for the app, that's when you'll see the throttling. This is determined via the cpu.limits or whatever the node is running on can provide. Either way it's not great.

Of course, the most obvious way to get around this is ...increase the cpu resources...or...well, increase/scale the number of pods. But that becomes really untenable and costly.

So this still goes back to making sure that your application is optimized and horizontally-scalable, but the whole takeaway of this post is that outside being sure that your application is optimized, the K8s CPU resource allocation is something you can look at...

...and that SkyNet is upon us and all of this will not matter much in the near future.

Senai Teklemichael

SRE?? @ Amazon | Project Kuiper ???| DevOps, K8s, Terraform, Python... ??????

1 周

feels strangely familiar ??

要查看或添加评论，请登录

Dexter Legaspi, MSc的更多文章

A Small Exercise with Java's CompletableFuture

2024年12月1日

A Small Exercise with Java's CompletableFuture

One of the cool features about Reactive Java is the Mono::zip or Flux::zip which allows you to combine unrelated…

2 条评论
Reactive Java + Blocking Code

2024年11月8日

Reactive Java + Blocking Code

We have a backend system that was written using Spring Boot 3.x with a full Java Reactive stack, and while for the most…
A Day in the Life of an Engineer: Me and Spring Dependency Injection

2024年5月24日

A Day in the Life of an Engineer: Me and Spring Dependency Injection

We have a REST API backend system that's relatively old and is using Spring Framework (version 4.x with Tomcat as…
Google Guava String Splitter

2024年3月13日

Google Guava String Splitter

Google Guava has a special place in my dark, cold heart; prior to Java 8, Google Guava has allowed some level of…
A Day in the Life of an Engineer: Me and Databricks

2023年12月15日

A Day in the Life of an Engineer: Me and Databricks

Over the last couple of weeks, one of the things I need to complete is to write a Databricks Workflow Job. A co-worker…

1 条评论
A Day in the Life of an Engineer: Me and ChatGPT

2023年10月18日

A Day in the Life of an Engineer: Me and ChatGPT

One of the things I had to accomplish today was to wrap up a piece of barely-working code that I wrote using Java…
A Day in the Life of an Engineer: A Weird Spring Stuff

2023年9月1日

A Day in the Life of an Engineer: A Weird Spring Stuff

Coming back from vacation and one of the things I encountered is a "bug" that was raised by a co-worker wherein an…
YourKit is My Friend

2023年6月21日

YourKit is My Friend

In the recent weeks there's nothing but endless posts and articles from developers on how ChatGPT and AI has…
A Day in the Life of an Engineer: Fixing Things Along the Way

2023年6月5日

A Day in the Life of an Engineer: Fixing Things Along the Way

I was in the process of implementing a new (relatively minor) feature on an existing service written with Spring Boot…
A Day in the Life of an Engineer: Dealing with Pull Requests

2023年4月22日

A Day in the Life of an Engineer: Dealing with Pull Requests

Most of the services we write usually involves checking code into two repositories: the first repository is where…

1 条评论

See all articles

Dexter Legaspi, MSc的更多文章

A Small Exercise with Java's CompletableFuture

Reactive Java + Blocking Code

A Day in the Life of an Engineer: Me and Spring Dependency Injection

Google Guava String Splitter

A Day in the Life of an Engineer: Me and Databricks

A Day in the Life of an Engineer: Me and ChatGPT

A Day in the Life of an Engineer: A Weird Spring Stuff

YourKit is My Friend

A Day in the Life of an Engineer: Fixing Things Along the Way

A Day in the Life of an Engineer: Dealing with Pull Requests

社区洞察