Being proactive with reactive scaling with KEDA
Navjot Bansal
Building Computer Vision Systems @Oracle | Software Architecture | System Design | ICPC Regionalist
Intro
https://naruto.fandom.com/wiki/Multiple_Shadow_Clone_Technique -
Naruto has been an inspiration to many and so to Kubernetes users. If you ever noticed how Naruto often relies upon Shadow clone Jutsu to create Identical replicas when he needs to conquer a burdensome task you are good to go.
For People who don’t watch anime here is a gif for you.
Overview
In Kubernetes, we deploy our services inside pods. Their pods are a part of replica sets. We generally deploy pods in multiples to enable certain users to use the service without breaking SLAs.
But what if a lot of users are sending requests to the same pods??
We would experience an exponential increase in the queue wait time. Which is not ideal.
How Kubernetes Scales
In order to solve this problem Kubernetes uses a concept called HPA (Horizontal Pod Autoscaler) which triggers the scaling of the application to the desired number of pods.
HPA can utilize its own triggers consumed by Kubernetes API or can rely on external triggers to scale the deployment.
While the default triggers come in handy they are not always useful. (They don't tell the complete story)
A lot of developers want to rely on custom metrics to trigger scaling. Their custom metrics can be anything ranging from the expected RPS the service needs to maintain to the Average latency of pods or in some cases they Use Machine Learning to predict the load based on current requests and scale accordingly
In this blog, I want to keep our discussion limited to custom metrics without involving any Machine Learning.
Custom Metrics
To collect custom metrics I primarily rely on Prometheus. Prometheus is a time-series bases metric collection tool that lets you collect any data you want in a time-based DB.?
The benefit of using a time-series-based metric tool is you can play around with cases like aggregated metrics over a period of time or the rate of change of metrics as a function of time.
HPA on itself isn’t capable of using Prometheus as a metric. This is where we introduce KEDA into the system. KEDA stands for K8s Event Driven Autoscaling and is an adapter that allows you to collect metrics from data collection tools and trigger scaling with the help of HPAs.?
A Simple overview of the whole System is shown below.
The steps of metric extraction:
desired_pods = current_pods * (current_metrics/desired_metrics)
领英推荐
My Opinion on custom metrics
Custom metrics have no boundaries. You can choose anything you like and make it a metric for your service. When provided with infinite possibilities we tend to complicate things. We try experimenting with custom-use cases and achieve scalability. But how do we make sure our Scaling is correct?
A lot of metrics can work and show good results. Should we accept All metrics as a plausible option??
General Custom Metrics?
All metrics can be used to keep request latencies below the proposed SLAs.
Prefer KISS (Keep it Simple Silly)
I firmly believe in the KISS principle when it comes to complicated problems. Why I am saying this because, while custom metrics can get as complicated as they can be. The deltas to calculate the values will keep on increasing. As the deltas increase the trigger to scale would be late and many of the customers will suffer. Trust me on this. I have been benchmarking a lot of alternatives and have noticed simpler solution works better. The easier it is to calculate the metric the better.
Keep in mind
The metric should be easier to collect
The metric should be sensitive to load change
The correlation between the metric and problem statement (latency in general) should be high.d
You can collect metrics before, during, and after the request has been processed.
My Advice would be the earlier you collect and store metrics the better. When we collect the metrics later, the first trigger to scale the deployment gets delayed. The lower the first delta the earlier the trigger activates.?
This drastically affected the P90 and P99 latencies observable to the customers. Nobody wants to wait.
Visualizing Latency graph for bad and good metrics?
Comparing the green and red graphs the deviation from p50 is higher in the red graph than the green graph which is a red flag
Our end goal is to keep the latencies under SLA under the worst scenarios. The goal should be to keep the P90 and P99 latencies down. They are generally observed before on during the deployment scaling and should be minimized.
Finalizing?your metrics!!
If you have multiple alternatives in your mind that are working fine for your use case, you would be wondering how to choose one over the other. Choosing on basis of average latencies works well, but my take is to prioritize RPS over average latencies.?
Why choose RPS over average latency
Average latency is again a metric observed once the request is processed, which is delayed feedback. I prefer RPS over latency because we can gather the expected?RPS on incoming requests by looking at the request-queue size. Which again is collected before the request is processed thus making the deltas almost negligible.
Thinking RPS
Imagine you are getting 10 req/sec on average in your queue and you are not processing them well under time. The queue size is bound to increase and will put your deployment under a resource crunch.?
If your service has SLA latency of 2 sec, while a pod on average takes 1 sec and you are getting 10 req/sec you should have at least 5 pods ready to accept requests.
This is a simple math that would help us scale faster and lighter without much burden on Prometheus and our brain of course.
Conclusion
In this blog, I have attempted to do a case study on how we can enable autoscaling in Kubernetes without much overhead. All mathematical calculations take similar time thus the goal is to acquire metrics as early as possible. I still believe we cannot scale proactively in KEDA without using ML all methods are described as reactive. Reactive in the sense that we take actions when the load is high.