Let's Talk About Averages
Perhaps the most important part of our job is making sense of data, primarily numbers. Typically, we look at numbers provided by load testing tools, server and application monitors, log files, or database queries.
The goal is always to understand the behaviour of our system, but sometimes the tools we use provide us misleading information.
Raw data versus aggregates
Let’s use the example of a simple load test. Say we run a short test which requests a single web page 100 times. The raw data would be what was recorded for each of these 100 requests (the key metric being response time). Raw data can be plotted in a chart called a scatter plot and I would argue that this should be the first thing you look at whenever you are analysing any performance related data.
Take the example scatter plot below which plots the response time for 100 requests:
The key observation is that there are three horizontal ‘bands’ of response times at around 3, 6, and 9 seconds. This might hypothetically be caused by a timeout/retry pattern where something is timing out after three seconds and re-submitting.
Now let’s take the exact same data and plot the average response time at one minute sample periods:
We can still see that response time is ranging between 3 and 9 seconds, but we have lost visibility of the pattern – we now understand less about the behaviour of the system under test. And what happens when we increase the sample period to five minutes?
We are now given the false impression that response time is around 5 seconds, when in reality there were no response times at all between 5 and 6 seconds during our test.
This does not just apply to averages (which cop a lot of flak!) but any aggregate metric including percentiles and medians.
Here’s another common example. Say we have a particular request which is takes over a minute 5% of the time but responds quickly the rest of the time. If we plot the the raw response time on a scatter (chart on the left) it is immediately obvious what is going on, but looking at the average of the same data (chart on the right) gives us the false impression that response time is around 4-5 seconds:
Sample period
The sample period is how often we collect our data. For example, we may take a snapshot of memory usage at five minute sample periods.
One of the most common situations where sample periods affect my job is in the interpretation of %CPU Utilisation metrics (a topic worthy of a separate blog post!). I was recently given production server resource usage data averaged at at one hour sample periods. During this time the average CPU usage did not exceed 40%:
However, this does not mean that the CPU was not saturated during the month of March. Within any given one hour period there could be 10 minutes where the CPU was consistently hitting 100%, 50 minutes where the CPU was at 10%, and the average would come out at around 25% – which is again, hiding the true behaviour of the system.
Sample size
Be vary careful when using aggregates when your data set is small. The smaller the sample size, the less confidence we can have in what we observe about it. If you are running short tests which only execute dozens or a few hundred requests your aggregate figures will fluctuate wildly between test runs.
One particular issue I see pop up regularly is using percentiles when the sample size is less than a hundred. The “90th percentile” of some response time data tells us that “90% of response times took this long or less” (the 50th percentile is the median).
If we have a sample size of just ten records then the “95th percentile” is not a meaningful metric. In this situation the 90th, 95th, and 99th percentile will all be equal. Aggregates only tell us useful information when the sample size is large enough. I’m yet to see a tool which warns or adapts to this kind of situation.
The limitation of scatter plots
Scatter plots can’t tell us everything. For one, they don’t tell us about the density of the data we are looking at. Take a look at the very extreme example below:
Here we see a solid block of response time results ranging from 1 to 5 seconds. What we cannot see, however, is that there are four times more records between 3 and 5 seconds as there are between 1 and 3. Even if I tweak the chart so we can see the difference:
… we still cannot see the scale of difference between the number of 3-5 second and 1-3 second response times.
One way you can see density is to ‘bucket’ your data to see how many records were recorded for each bucket. Below I have grouped the same response time data into 200 millisecond ‘buckets’ and plotted how many requests occurred for each. We can now see the scale of the difference between the two bands of response time:
These ‘buckets’ are a way we can use aggregates to improve our understanding of the raw data.
What else are aggregates good for?
Aggregates are good at tracking change over time. Given the density problem I mentioned earlier, it can be hard to see whether there is a general degradation over time by only looking at the raw data.
Aggregates can also be good at giving us a general sense of the performance in the right situation. For example – if the response time of a particular resource consistently takes around 2 seconds with no outliers or significant deviation then in that particular case the average is a reasonably accurate representation of the user experience.
Closing
The point of this article is not to dissuade you from using aggregated data. What is important is understanding the limitations of the data we are presented, and that we provide to our customers. Ask yourself:
- What does this data tell me?
- What does it not tell me?
- What seems to be shown by this data, but could be misleading?
If you keep asking yourself these questions you will enhance not only the value provided to your customers, but also the integrity of the performance testing industry.
This was originally posted on my WordPress blog.
Senior Performance Engineer
6 年R has some handly built in plotting features which help for this too
DevOps | Performance Engineering | Performance Testing | Power BI | PowerShell | Power Automate | Docker Containers | Logic Apps | Debugging | Root Cause Analysis | LoadRunner | JMeter | Neoload | DynaTrace | AppDynamics
7 年Agree with this totally. Real numbers from test are more important than the calculated one's.
Programme Test Manager
7 年Great article Stephen Townshend