Spread, It's Not Just for Sandwiches Anymore (Location and Spread, Part 2 of 2)

Spread, It's Not Just for Sandwiches Anymore (Location and Spread, Part 2 of 2)

A long time ago, when I still had student drivers living under my roof, I went to a lecture regarding road hazards and teenage drivers. The lecturer said on any given Friday night 62% of all drivers were under the influence of alcohol. I raised my hand and asked, “What is the standard deviation?”

The lecturer blinked twice. With a blank stare, she said, “I don’t know.”

Unsatisfied by the answer, I probed a little deeper, “The 62% number must be the mean…”

Again, a blank stare.

“An average,” I explained. “But what is the variance associated with that statistic?”

Another blank stare.

I tried to elaborate, “Is that number based on urban drivers? Rural? Both? Is the percentage constant? Or does it vary by hour? Is it based on gender or age? Why should I just accept that percentage is true?”

She admonished, “It’s in a magazine!” (That’s what people used to spew callously as the common explanation above reproach before high tech changed it to, “It’s on the internet!”)

Whenever someone tries to cram a mean down your throat, demand the standard deviation!

 In Part 1 of this two-part series the concept of statistical Location was discussed, where the location value represents the “best” number summarizing a data set. That’s only half the one-two punch making statistics such a powerful tool. The other half is Spread.

The mean and standard deviation will be discussed as values for Location and Spread, respectively, but first let me be a little more generic. Spread gives a measure of variance surrounding the location value. If the data set has low variance, then the data points are more tightly clustered around the Location. If the variance is high, then the data points are more spread out around the Location.  

Graph of two data distributions. One is narrow with a high peak indicating low variance.  So, the data is clustered tight around the location value. The second has the same location value, but greater variance. So, the distribution is wider and the peak is lower.

Spread validates the Location.

Low variance makes the Location more believable as a value best summarizing the data set, while high variance challenges that believability.

Often Location and Spread take the forms of mean (i.e., the average) and standard deviation. So, what the blazes is a standard deviation? Take a joyride with me through Formula Funland. Recall from Part 1 the formula for the mean, which is just mathematical mumbo jumbo for saying, “Add all the data values and divide by the total number of data points.”

Formula to calculate the arithmetic mean. X-bar (a symbol showing an x with a line above it representing the sample mean) equals the summation for i equals one to n of x i divided by n.  Where i is a counter incrementing by one, x i represents each data point in the data set, and n is the total number of data points in the data set.

Now, let’s stay on the trolley further to the standard deviation ride: 

Formula to calculate the sample standard deviation. s (the variable used to represent the sample standard deviation) equals the square root of the summation for i equals one to n of the quantity x i minus x-bar, squared, divided by the quantity n minus one.  Where i is a counter incrementing by 1, x i represents each data point in the data set, and n is the total number of data points in the data set.

Yikes! You didn’t know you got onto a roller coaster! But do you notice anything similar between the formulas? Yes! You’re still adding up “stuff” and dividing by n (actually n-1 because it’s a correction for the bias of the variance estimator—a “fudge factor”). So basically, it’s still an average. In the numerator, the mean is subtracted from each value in the data set. It is the distance between each data point and the mean. Or said another way, it’s the deviation of each data point from the mean. All these deviations are added together, then the sum is divided by n (ahem, n-1) to average them. Hence, a standard deviation is simply the average deviation from the mean. Or, on average, each data point deviates from the mean by the value calculated as the standard deviation.

By the way, the squaring of each deviation from the mean in the numerator ensures the values are positive. Then, taking the square root at the end converts it back into the same units as whatever you’re measuring. Having Location and Spread in the same units allows them to complement each other nicely.

Spread (or variance) has to stay positive because it is a measure of distance. Why must distance be positive? Well, if you drive five miles to a store, then on your way back you don’t drive negative five miles. You still drive five miles back for a total of ten miles driven on that trip. Think about it, if you drove five miles there and negative five miles back, then you’ve driven zero miles total. So, how did you burn up half a gallon of gas if you drove zero miles? Distance is a positive value.

The mean is a Location value and the standard deviation is a value for Spread. Put them both together and the mean is the one number that best summarizes the data set and the standard deviation is the average amount each data point varies from the mean. So, the smaller the standard deviation, the less overall spread there is around the mean.

Of course, it’s more complicated than that and there are a boatload of assumptions, criteria, and caveats (I mean we statisticians have to ensure job security), but in a nutshell that’s the gist of the one-two punch of Location and Spread making statistics so powerful.

Graph of a skewed distribution with a long tail extending to the right.  It shows how the mean is “pulled” toward the long tail.

So, for better or for worse, often Location and Spread take the forms of mean and standard deviation, even when they may not be very good—like for skewed distributions where extreme values cause a long tail in one direction. If you recall from Part 1, extreme values in the data set can “pull” the mean toward them, thereby distorting the mean’s ability to be the best number summarizing the data. This can lead to inflating the standard deviation since the mean is used in calculating the standard deviation.

Recall there are values other than the mean which can be used to represent Location. Similarly, there are other values which can be used for Spread, but they can be difficult to calculate, interpret, and use. So, I won’t even discuss them in this forum. You can consult a statistician if your data set is skewed. (Oh boy! More job security for statisticians.)

I can hear you now, “Great, Dave, that’s all interesting. I mean, the mean is, well, meaningful. But how will that standard deviation thingy really help me?”

It’s true the mean is a number that makes intuitive sense in its own right. It provides an anchor for your data set. Whereas the standard deviation, on its own, is a bit more nebulous. In combination with the mean, it gives a framework in which to assess the validity of the mean, but it does have many more uses.

The standard deviation is a measure of variance. Variance is required to construct probability models. Probability models are required to calculate the probability of events, differences between means (using such tools as t-tests, proportions tests, or ANOVA), regression analysis, reliability analysis, time series analysis, contingency table analysis, multivariate analysis, and on, and on.

Without variance, statistics would not exist.

I can see your eyes glazing over and hear your thoughts coming through loud and clear in your best Mr. Spock voice, “Fascinating.” But still you want to know how the standard deviation can boost your bottom line.

Okay, suppose you’re tasked with finding a supplier to provide a component for a product. The nominal value to hit for a critical characteristic is 28.0 units (you can replace units with inches, millimeters, foot-pounds, degrees, millimoles, hectares, cubits, Mississippis, or whatever unit makes sense in your unique perception of reality). You’ve narrowed it down to two suppliers who will initiate product development, prepare prototypes, and work up pricing. Then they come back with the results.

Supplier A can meet the nominal value and the cost per part is one dollar.

Supplier B can meet the nominal value and the cost per part is ten cents

Which do you choose?

At face value, it seems a no-brainer. If they can both meet the nominal value, why spend one dollar per part when you can get it for ten cents? But, before you make that final decision, a nagging, prickly feeling tingles at the nape of your neck. You vaguely recall an arcane article you read on LinkedIn written by some obscure stats-data whisperer-whatever guy who said, “Whenever someone tries to cram a mean down your throat, demand the standard deviation!” So, you ask and they show the results below.

Histograms of the two supplier’s data. Both meet the specification of 28 units (Supplier A mean is 27.998 and Supplier B mean is 28.007), but the standard deviation for Supplier A is 0.204 and the standard deviation for Supplier B is 0.800, roughly four times more than Supplier A. The histogram for Supplier A is narrow with a high peak at the mean. The histogram for Supplier B is much wider and has a lower peak at its mean. This demonstrates the parts produced by Supplier A tend to more consistently measure close to the mean than do the parts for Supplier B.

Now, which do you choose?

You may flip-flop, but again it seems like a no-brainer. With its significantly lower standard deviation, Supplier A will provide more uniform parts. It’s reasonable to assume the lower variance will lead to much greater consistency in the functioning of your product. Whereas the much higher variance exhibited by the Supplier B parts will probably cause wider swings in your product and it may behave erratically. There’s also a much higher chance Supplier B will ship parts out-of-spec, leading to additional costs for sorting and rework. So, you decide to award the job to Supplier A.

Is that the “right” choice?

“What?!” You shout. “Come on now! How can my choice even be questioned? I remember reading the article and demanded a standard deviation. It indeed validated the mean.” You even harrumph indignantly for emphasis.

Yes, you did demand the standard deviation and it did validate the mean. But either you didn’t read other articles I’d written or simply don’t remember the statistician’s favorite answer, “It depends.”

A picture showing eight green plastic toy soldiers in different combat poses. Each soldier has weapons and/or other gear. One holds a bazooka on his shoulder in a position for firing.

If your product is used during brain surgery, and the difference between a patient going home to resume a normal life or spending every future day lying in bed unable to move or speak depends on how your product functions, then choosing Supplier A is probably the best decision. However, what if the product is toy soldiers? You know, like the little green men from the movie Toy Story who ran reconnaissance for Woody and the rest of Andy’s beloved toys. Maybe the critical characteristic is the length of the soldier’s bazooka. Does it really matter if one bazooka is a little longer or shorter then another? The real question is does it matter enough to spend ninety cents more per soldier? Probably not. So, in this case Supplier B at ten cents per part is the best decision. See, it does depend.

No statistic will ever choose for you. It’s just a tool to help in your decision-making process. But the standard deviation does provide you with more evidence to consider in support of making an informed decision tailored to the individual needs of the current issue. And, if you do market toy soldiers and award the job to Supplier B, you might just get a big fat whopping bonus for saving the company so much money.

So remember, whenever someone tries to cram a mean down your throat, demand the standard deviation! It validates the mean.

要查看或添加评论,请登录

David Tomczyk的更多文章

社区洞察

其他会员也浏览了