Performance Stats - Why Average is Wrong
https://i.stack.imgur.com/jAKWc.png

Performance Stats - Why Average is Wrong

This came up recently  in a discussion. The idea was, is anything average can be used as a metric of something as serious as a performance benchmark? We have all seen stuffs like this :

Do they actually mean something? Do they mean anything at all? Unfortunately, for all the common people reading this graph, they simply lack the foundations which are needed to interpret these sort of data. That is partly also because these data are not what is termed as meaningful by any professional.

Observe that, when some system runs over an input, the time taken to produce outcome varies not only from input to input, but also from time to time. Formally, then, given the same input, the time taken can be any value, even an infinite value. That qualifies the time taken to respond as a Random Variable. ( These random variables are not variables, per say, they are a very special type of functions called : Measurable Functions, if you are mathematically inclined ).

What are the examples of a random variable? Well, the result of a die throwing works, so is the result of a coin toss. Suppose there is this guy standing at some point. A coin is tossed, and if head comes, he moves left one step. If tails appear, he moves right one step. The position of the guy from the current base point is another random variable, and this specific phenomenon is called Random Walk.

There is a hypothesis that the stock market price follows random walk. These things are important to understand performance of a system, why? Because the current queue size on a system under test is nothing but a random walk. It is easy to think why it is, and thus I am omitting it. Just think that there is a chance that a new request will come per time slice, and that there is another independent chance that a request would be fulfilled and dispatched to client. The balance is stochastic in nature. 

So, how to characterise the time taken to respond back to the client? As it is obvious, the only characteristic of a random variable is it's Density Function. The density function can be experimentally obtained by the methods of the histograms - observe the below image : ( which shows why average is bad)

But the image is wrong. To actually generate the density function, one needs to normalise the data, that is, needs to plot the densities over time slices. The time slices read - between 0 to 1 sec, x_1 % of request, between 1-2 sec , x_2 % of requests.... thus, between n to (n+1) sec, x_n % of request. Thus, a curve can be established over histogram, which is known as the pdf. Interestingly, a PDF, f(x) has a very nice property because probability is a measure normalised to 1 :

Also,  more interestingly, it tells you, between time T1 to T2, how many requests ( in percentage ) are done, rather what is the probability that a response time would be between T1 and T2  :

That, is precisely what a pdf is known for. This also gives a glimpse on what we should actually measure, given there is a pdf. Suppose we are having a real time system -  that would mean, I would have a deadline Td, below which we we should have responded back. What we would really want, is to take almost all of this pdf to the left side of the deadline Td, that is :

Note that, this holds because no system should be able to respond back in less than 0 time. Now, what do we mean when we say approximating 1? We probably mean ( pun intended ) that the probability that the system responds before Td, is almost 1, which in most of the practical scenarios are not to be. So, we do an engineering trade off, we choose a critical probability, instead of 1 :

 Most of the cases *people* have chosen this Pc as 0.9, or in English, it is called 90 percentile mark. Where they would be useful? Everywhere. The formula is self sustaining, it basically says, when you are giving 90% data : 

"90% of the time, the response time taken for an user would be less than or equal to this time : T90"

Hence, it is the only viable idea to reduce performance metrics to a single scalar. There are fancier stuffs, to find the spread of the distribution, to check if it is not really a long tailed one ( see below image) but they are too fancy stuff. 90% works, and works well for 80% of the applications. For more fancier applications, we can choose to use 99%, or rather if  you are into the whole idea of  six sigma, then 99.97 %. 

 

 

Choosing these parameters are not of much importance here, because they are a matter of choosing what suits your need. But one thing for certain, averages do not work for you, ever. A much better alternative would be Median value, which is nothing but the 50%, thus giving you a 50/50 chance. Choosing median makes your Pc as 0.5, while choosing 90% makes your Pc as 0.9. Now, take a look around the poster image of this post. Given the curve is that way, average would most certainly fail to showcase any meaning. Averages also fails in the cases of asymmetric distributions, including long tailed distributions.  In case of Poisson distribution, there is only a single parameter, which is the average, and which defines the distribution, hence, the 90% is a function of the same parameter.

Addendum:

Normally the craze of a post wanes away over time. But looks like this particular post is still alive and proverbially kicking some very old long held spurious belief systems. Thus, I would stoop to down from the theorists to an performance minions level - and showcase real practical example of what is really happening. Here is my favourite ( and created ) nJexl code to perf-test Bing.com :

all it does : calls 30 times bing server with the query "average", to showcase the following result on response times (this is do it at home yourself stuff): 

thus, in short, the claim that a stable system generates responses which are not much different from the average is a very wrong one. The notion comes when one does not comprehend the Dirac Delta function, which  is a limit to 0 centred Normal distribution. Averages are almost surely a bad statistic of a population. And yes, I am telling this fully knowing the statements of law of large numbers and Central Limit Theorem. Observe that in reality, 90% would significantly differ from Max, as well as average.

Thus, the summary:  Do not use average, anywhere, ever in performance testing. That is the worst cardinal sin one performance tester can commit. Something is easy to calculate, does not mean it would be meaningful, and in this case it is not. But averages are important too, when we have to go with the Martingale Hypothesis, generally it does not appear in performance testing (you can not have an unbiased random walk as the model of the queue size build up, that is not engineering).

Kai Zhou

SIMPLICITY IS THE ULTIMATE SOPHISTICATION

8 年

The response time for the logically same request type is almost consistent ( for example, just within a few millisecond offset to the average ) when a system running under a strictly safe load. When scaling up the load and from certain point of time the response time starts growing ... this new load level is not safe for the system anymore and given long enough period of running under unsafe load, the "timeout" is destined to happen. The average is very meaningful for identifying the safe load because it is just almost the truth.

回复
Rahul Verma

Author of The Last Book on Testing | A Student of Testing & AI | Satirist | ??????????????

9 年

Ruling out something as totally wrong is a maxim to be avoided. You are ofcourse better placed at math than I am. Here's my reference for what I am saying: https://msdn.microsoft.com/en-us/library/bb924370.aspx

回复
Sergio Boso

ISO 27001 auditor, DPO & consultant

9 年

Hi interesting study, I agree with this analysis. I only would like to point that usually IT system are modelled according the queueing theory, so response time cannot be predicted through a Gauss distribution. A much complex distribution is required instead, and this reinforce your picture. Beside that, if we also assume a pure end user service point of view, percentiles give an answer to a very clear and important question: what kind of service level am I going to give to 90% (or 99%) of my customers?

Anirban Chatterjee

Applied Science Leader@WalmartLabs, India

9 年

Overall I agree with you except a small point. There can be some kind of metrics where people are not really interested about transient peaks or dips. For example in performance engineering, CPU usage or memory usage can be such metrics. The same holds good for various business metrics as well - the sale of a particular product etc. Now having said that, I totally agree that computing TP90 for these metrics will serve our purpose but I think for these kind of metrics where we are not at all interested in transient behavior, computing average makes more sense as it is more computationally efficient.

回复
Scott Stevens

Senior Performance Engineer

9 年

Aberages can be misleading - however, you need to view them from 2 perspectives analysis of results versus the creation of your workload model. As Stephen said, if your response time is consistent and all other statistical tests indicate the average is meaningful then yes, use it within those caveats. When creating a workload then you don't have a choice but to think of average session time +- a variance and average transactions per sec etc as that is the foundation of Littles Law

回复

要查看或添加评论,请登录

Nabarun Mondal的更多文章

  • Created by a Machine : On Machines and Creativity

    Created by a Machine : On Machines and Creativity

    We are seeing a lot of petty discussions about machines capable of doing anything and everything from Quora to LinkedIn…

    5 条评论
  • Implication, Probability, Logic : IPL

    Implication, Probability, Logic : IPL

    IPL is the most important thing ever happened since India became independent. No matter what the critics say, it is the…

    3 条评论
  • Programming, Pragmatism, Nirvana

    Programming, Pragmatism, Nirvana

    Being Agile is one of the most persisting fashion of today. While the agile manifesto ( I sincerely ask you to read it…

  • On P, NP, Partitions and Interviews

    On P, NP, Partitions and Interviews

    Roaming around various forums where IT interview questions are discussed, bears fruit sometimes! Like today : an…

    2 条评论
  • Re-Setting Expectations On Testing

    Re-Setting Expectations On Testing

    Pledge - The Lamenting I came along this lament, not so long time ago, by a very senior Engineering Manager working for…

  • Currency: Puzzles into Backtracking

    Currency: Puzzles into Backtracking

    Money is *the* prime mover for any society. Apparently, with only a bit of push, monkeys got introduced to currency and…

  • Seldom been KISSED?

    Seldom been KISSED?

    Power of Being Simple Much power lies in simple, stupid reasoning. If one pause to study the epic The Selfish Gene, one…

    1 条评论
  • Keep it simple, because we are stupid

    Keep it simple, because we are stupid

    Simplicity, has it's advantage. As Einstein said , in effect, "that everything should be as simple as it can be, but…

  • On Vader, Valiance and the Art Of Leadership

    On Vader, Valiance and the Art Of Leadership

    Star Wars Well, there are only two sorts of people who watch movies. Those who watched Star Wars, and those who will…

    2 条评论
  • Keyword Driven Test Automation, Works?

    Keyword Driven Test Automation, Works?

    Onto Keywords Industry is abuzz about something we experimented a decade back - called the Keyword Driven Testing…

    12 条评论

社区洞察

其他会员也浏览了