Machine Learning and Load Testing: When Was the Last Time You Ran 60 Load Tests in a Day?
Image generated by OpenAI. I'm sorry, I really have no graphical skills

Machine Learning and Load Testing: When Was the Last Time You Ran 60 Load Tests in a Day?

This article is my review of Akamas (Akamas.io), a solution to optimize your platform configuration, using machine learning. It's not sponsored and contains my opinions and impressions of the free trial. Make sure to check them out if you think you're too busy for tweaking and tuning.

"If it works, don't touch it"

It's a common rule not to disturb a well-defined and well-tested code base or configuration without justification. The main motivation for this rule is usually to prevent the cascading effect of any change brought to the system. It's never "just a line of code" or "just one property". Modern systems and applications are convoluted, and an improvement to one resource may result in a regression in another. An increase in memory size may result in longer garbage collection (GC) pauses. An increased thread count that was supposed to increase your throughput suddenly creates thread contention, slowing your application down. Batching requests improves resource efficiency, but at the cost of end-to-end times. A configuration that works in isolation doesn't necessarily guarantee the same results after integration or deployment on another platform.

Application performance is full of trade-offs like that, and reaching the perfect balance of resource usage and user satisfaction takes a lot of expertise and resources—and mostly, trial and error. These trials and errors come at a cost, and depending on your testing maturity, this cost can get high quickly. Apart from the costs of hardware, the workload required to run each test, and, of course, the time spent analyzing the results and preparing the report, the biggest cost is the time not spent on testing new deliveries, effectively blocking the delivery pipeline.

Prerequisites

Meeting the prerequisites for testing with Akamas is...worth it on its own. The operational requirements are really just a set of good practices, so I'll list them out just in case you're still testing like it's 2011:

  • Continuous Testing: Whenever you hear "performance" and "continuous" in the same sentence, it typically means an independent, repetitive test at any given time, with no human intervention. Problems like data drifting, storage limits, and bloated database tables need to be fully automated so they don't affect the next test run. This also means you can't update that one property before each test because it was once entered as an attribute, or transfer this file, or generate new test data. You get the idea. I want to run a test, and it's running in a few minutes. The time between test runs determines the accuracy of your improvements. Machine learning requires lots of data, and it's best if you deliver it fast. Consider mocking to minimize the impact of external dependencies on your test results.
  • Continuous Monitoring: This is for monitoring the KPIs and SLAs. You should be able to monitor these and let the Akamas engine parse them. Akamas integrates with major observability providers, particularly Dynatrace, but even if you don't use any, it won't say no to a structured CSV file to gather the metrics, or an exposed Prometheus endpoint.


Using Akamas

Once you have your app integrated, you start by creating a study. This is a prime example of enforcing best practices because you can't create a study without defining a goal. Let's cover the basic elements of Akamas.

Main dashboard with my studies and their status

Akamas Study

A Study is an entity holding your experiment's requirements, criteria, and execution. Before you can run any tests with Akamas, you need to define the study goals. The most common ones are minimizing cluster costs, reducing response times, or lowering error rates. It could also be all three. This is probably the most important part of test execution because the parameters will be scored and weighted based on the outcome of these goals.


Setting Up a Study Goal

Goal Constraints

You can specify experiment assertions and flag them as failed if they don't meet certain criteria. These can be either absolute or relative to the baseline. The most common ones would relate to your SLOs, like response times or error rates. You could also specify your own, as long as Akamas has access to the metrics.


Adding goal constraints based on Metrics



KPIs


The Akamas engine will automatically add your study goals and constraints as KPIs. These KPIs will be the main factors for auto-tuning and will be used to determine the most optimal configuration. You can also add custom KPIs in this step. They serve as a guide for parameter definition.

KPIs populated based on the study goal and goal constraints



Windowing

f you've ever tried comparing multiple test runs, you'll learn that they won't always look the same. Even if you run the same workload against the same application, some external (or internal) factors can cause instability in your load test. Running the same test on a cold vs. warmed-up environment may yield different results. To overcome this challenge, Akamas can automatically pick a timeframe from the test where the metrics are the most stable. By default, you can trim the metrics yourself if you know the ramp-up in your test is enough to warm up the environment.


Windowing options

Parameters

These are your inputs, and the Akamas engine will try optimizing them with each experiment (iteration). You can specify the exact parameters you want your experiments to focus on, or, if you want to take full advantage of the engine and test out many possibilities, you can just mark them all and let Akamas decide which parameter might help you achieve your goal.

Parameters selection based on available options from JVM


Running Studies

This is the stage where the magic happens. The convenient part is that you don't have to analyze every single run. The best part is to leave the application running for the weekend and come back on Monday to see the best configuration for your setup. You do, of course, get insights for each test run and see the impact of each parameter change.

Akamas starts with application baselining—that's your reference, using default or hinted parameters. Any deviation from the baseline KPIs will be included in the study scoring and let Akamas decide whether to pursue further changes to specific parameters.




Algorithm

At its heart, the Akamas optimization algorithm is an implementation of reinforcement learning. This means more accurate, efficient, and predictable results for multidimensional time-series data inputs. If you really want the details, they've published a paper on the algorithm's behavior:

https://15799.courses.cs.cmu.edu/spring2022/papers/07-knobs2/p1401-cereda.pdf

And a granted patent in the US:

https://patents.google.com/patent/US11755451B2/en

The key addition to the standard reinforcement algorithms you might know is the adaptability to varying conditions of your workload and environment state. Noisy neighbors, partial environment instability, sudden workload changes—all of that may generate false signals to the engine. You can read all the details in the published papers.

Visual Representation of a 3D path finding algorithm based on reinforcement learning.

Study Results

In my example, Akamas managed to cut down the response times by 28%, just by modifying my heap size parameters. It did well tuning undefined parameters. At first glance, the heap size remained the same, but tweaking the new size yielded the best results. For the GC type, it chose Parallel, which is most suitable for small heap applications. It also tried other algorithms, like G1, Serial, and ConcMarkSweep, but looking at the numbers from the experiments, Parallel had the biggest potential for this workload and application. I was very pleased to see the progress and accuracy of the test results. Setting a fixed new space is not the first choice of developers and is usually a last resort when dynamic sizing doesn't fulfill the criteria, but based on the end result, it turned out to be a good option.


Technology Stack Support

Akamas comes with a plethora of optimization packs. I've only tried the JVM one, but the number of supported technologies is listed below. Each optimization pack comes with predefined parameters with already defined constraints. You can also combine optimization packs and set up your study to tune both your application and your Kubernetes cluster for your workload to ensure they fit together. One of the free trial studies showcases the optimization of a JVM application running on Kubernetes.


Optimization Packs Available

It also supports the major Load Testing tools out of the box, as well as the industry-standard monitoring tools.

Risks

Overfitting: Like all machine learning algorithms, reinforcement learning is prone to overfitting. This means that after too many iterations and training, the provided parameters would only work for your test workload and only for the specific application conditions. The accuracy of the optimization would be greatly dependent on how well your load tests represent your actual workload in production. The tests you run also rarely include the "ghost" requests untraceable by APMs, and they may contribute greatly to your overall application utilization. Keep that in mind before applying overfitted parameters to your production system. You should also add additional KPIs to make sure you leave some resources available for an unforeseen load, like an 80% CPU constraint, to leave some buffer for unforseen load.


Testing in Production

If you don't have a designated performance team or suitable tests to simulate your production workload, Akamas has an option to perform a Live Study, which is optimisation on your live system. You can let Akamas monitor your current production metrics and provide recommendations to be applied manually by your team. Because I don't have a production system at the moment, I couldn't try this option, but here's the details in case you're interested.

https://docs.akamas.io/akamas-docs/using/study/live-optimization-studies


Chances

Exploration and Exploitation: Akamas doesn't have "best practices" embedded into its engine. And that's a good thing because it's not biased towards a configuration propagated by all the books and articles you can find online. Why? Because most often, the recommended parameters are for general use cases, and your application might have a specific footprint where standard configurations will not apply. This allows you to easily explore new options and configurations, and if they're proven beneficial, find their absolute limits by exploiting surrounding parameters.


My Impressions

Akamas provides innovative approach to tuning your application - with a sole focus on the KPIs and SLOs (which is all that matters) and a well defined goal, turning your application into a black box. The more distributed your system, the more resources and costs you can cut down with your experiments. For monolith applications, the risk of overfitting might set you back from the "best" recommendations available to more stable but still more beneficial in the long run.

One thing I didn't cover, which is how Akamas is intended to be used in the enterprise setup, is the support of CLI.

Does it substitute a domain expert, if it yields equal or better results? One thing it doesn't tell you, is why the parameter applied changed the behaviour of the application. Understanding of the impact of the change helps shaping your application structure for future. That's why an oversight of a person familiar with the domain might be still required to make sure the configuration is also future-proof and will scale properly. On other hand - you could always just run a new study with the new version and have a completely new configuration each release.

It's definitely worth trying, especially if you have a lot of exploration tasks in your backlog that you know will take a long time, like trying out a new GC algo or simply cut down the costs of your app. Akamas can do it for you. They have a free trial available here if you want to check them out.



Stéphane Mader

Senior-PM(NeoLoad)@Tricentis - Associate@TimeForThePlanet

5 天前
Sravanthi Naga

Talks about ?Performance Engineering ?DevSecOps ?Kubernetes ?SRE ?People Leader

3 周

Great review Jakub Dering Its quite useful

Jakub Dering

Tech Lead For Performance Engineers/ Conference Speaker

3 周

要查看或添加评论,请登录

Jakub Dering的更多文章

社区洞察

其他会员也浏览了