Let's Talk About Averages

Let's Talk About Averages

Perhaps the most important part of our job is making sense of data, primarily numbers. Typically, we look at numbers provided by load testing tools, server and application monitors, log files, or database queries.

The goal is always to understand the behaviour of our system, but sometimes the tools we use provide us misleading information.

Raw data versus aggregates

Let’s use the example of a simple load test. Say we run a short test which requests a single web page 100 times. The raw data would be what was recorded for each of these 100 requests (the key metric being response time). Raw data can be plotted in a chart called a scatter plot and I would argue that this should be the first thing you look at whenever you are analysing any performance related data.

Take the example scatter plot below which plots the response time for 100 requests:

The key observation is that there are three horizontal ‘bands’ of response times at around 3, 6, and 9 seconds. This might hypothetically be caused by a timeout/retry pattern where something is timing out after three seconds and re-submitting.

Now let’s take the exact same data and plot the average response time at one minute sample periods:

We can still see that response time is ranging between 3 and 9 seconds, but we have lost visibility of the pattern – we now understand less about the behaviour of the system under test. And what happens when we increase the sample period to five minutes?

We are now given the false impression that response time is around 5 seconds, when in reality there were no response times at all between 5 and 6 seconds during our test.

This does not just apply to averages (which cop a lot of flak!) but any aggregate metric including percentiles and medians.

Here’s another common example. Say we have a particular request which is takes over a minute 5% of the time but responds quickly the rest of the time. If we plot the the raw response time on a scatter (chart on the left) it is immediately obvious what is going on, but looking at the average of the same data (chart on the right) gives us the false impression that response time is around 4-5 seconds:

Sample period

The sample period is how often we collect our data. For example, we may take a snapshot of memory usage at five minute sample periods.

One of the most common situations where sample periods affect my job is in the interpretation of %CPU Utilisation metrics (a topic worthy of a separate blog post!). I was recently given production server resource usage data averaged at at one hour sample periods. During this time the average CPU usage did not exceed 40%:

However, this does not mean that the CPU was not saturated during the month of March. Within any given one hour period there could be 10 minutes where the CPU was consistently hitting 100%, 50 minutes where the CPU was at 10%, and the average would come out at around 25% – which is again, hiding the true behaviour of the system.

Sample size

Be vary careful when using aggregates when your data set is small. The smaller the sample size, the less confidence we can have in what we observe about it. If you are running short tests which only execute dozens or a few hundred requests your aggregate figures will fluctuate wildly between test runs.

One particular issue I see pop up regularly is using percentiles when the sample size is less than a hundred. The “90th percentile” of some response time data tells us that “90% of response times took this long or less” (the 50th percentile is the median).

If we have a sample size of just ten records then the “95th percentile” is not a meaningful metric. In this situation the 90th, 95th, and 99th percentile will all be equal. Aggregates only tell us useful information when the sample size is large enough. I’m yet to see a tool which warns or adapts to this kind of situation.

The limitation of scatter plots

Scatter plots can’t tell us everything. For one, they don’t tell us about the density of the data we are looking at. Take a look at the very extreme example below:

Here we see a solid block of response time results ranging from 1 to 5 seconds. What we cannot see, however, is that there are four times more records between 3 and 5 seconds as there are between 1 and 3. Even if I tweak the chart so we can see the difference:

… we still cannot see the scale of difference between the number of 3-5 second and 1-3 second response times.

One way you can see density is to ‘bucket’ your data to see how many records were recorded for each bucket. Below I have grouped the same response time data into 200 millisecond ‘buckets’ and plotted how many requests occurred for each. We can now see the scale of the difference between the two bands of response time:

These ‘buckets’ are a way we can use aggregates to improve our understanding of the raw data.

What else are aggregates good for?

Aggregates are good at tracking change over time. Given the density problem I mentioned earlier, it can be hard to see whether there is a general degradation over time by only looking at the raw data.

Aggregates can also be good at giving us a general sense of the performance in the right situation. For example – if the response time of a particular resource consistently takes around 2 seconds with no outliers or significant deviation then in that particular case the average is a reasonably accurate representation of the user experience.

Closing

The point of this article is not to dissuade you from using aggregated data. What is important is understanding the limitations of the data we are presented, and that we provide to our customers. Ask yourself:

  • What does this data tell me?
  • What does it not tell me?
  • What seems to be shown by this data, but could be misleading?

If you keep asking yourself these questions you will enhance not only the value provided to your customers, but also the integrity of the performance testing industry.

This was originally posted on my WordPress blog.

Scott Stevens

Senior Performance Engineer

6 年

R has some handly built in plotting features which help for this too

Harjeet Johar

DevOps | Performance Engineering | Performance Testing | Power BI | PowerShell | Power Automate | Docker Containers | Logic Apps | Debugging | Root Cause Analysis | LoadRunner | JMeter | Neoload | DynaTrace | AppDynamics

7 年

Agree with this totally. Real numbers from test are more important than the calculated one's.

回复
Lee Shelton

Programme Test Manager

7 年

Great article Stephen Townshend

回复

要查看或添加评论,请登录

Stephen Townshend的更多文章

  • Monitoring your Mac with Prometheus

    Monitoring your Mac with Prometheus

    A few weeks ago I was exploring SquaredUp Cloud which is an dashboarding and visibility platform that lets you connect…

    6 条评论
  • Running your first Kubernetes workload in AWS with EKS

    Running your first Kubernetes workload in AWS with EKS

    I have been using Kubernetes for about a year and a half, but through all of that time I've only ever deployed…

  • Containerising a Node.js app

    Containerising a Node.js app

    As a Developer Advocate, I need to keep my technical skills up to date and to practice what I preach. One way I'm doing…

  • A Year as an SRE

    A Year as an SRE

    A bit over a year ago I transitioned from performance engineering into the world of Site Reliability Engineering (SRE).…

    7 条评论
  • The HTTP Protocol (explained)

    The HTTP Protocol (explained)

    What's this all about? A few years ago, I started writing a book about performance engineering. I only finished a rough…

    6 条评论
  • Running Grafana & Prometheus on Docker

    Running Grafana & Prometheus on Docker

    We're in the process of standing up a monitoring platform on Kubernetes. Before we started this process I had very…

    11 条评论
  • Is cloud computing killing performance testing?

    Is cloud computing killing performance testing?

    I 've received a few messages recently from individuals concerned that performance testing is "on the decline". The…

    17 条评论
  • Wrapping up 13 years of performance engineering

    Wrapping up 13 years of performance engineering

    Thirteen years ago, I fired off my CV to a few dozen organisations looking for my first job in IT. Months later, after…

    9 条评论
  • Performance Engineer to SRE?

    Performance Engineer to SRE?

    Two months ago I transitioned from a performance engineer to a site reliability engineer (SRE). It's been terrifying at…

    21 条评论
  • Before you automate your performance testing…

    Before you automate your performance testing…

    This year I’ve been working in a large program of work. My role is to oversee the performance testing and engineering…

    14 条评论

社区洞察

其他会员也浏览了