Eeny, MEAN-y, MEDIAN-y, MODE (Location and Spread, Part 1 of 2)

Eeny, MEAN-y, MEDIAN-y, MODE (Location and Spread, Part 1 of 2)

“Eeny, MEAN-y, MEDIAN-y, MODE. How well do your statistics bode?” Statisticians use this simple child’s rhyme to pick who goes first at statistics conventions when playing Pin the Hat on the Sigma or when we bob for x-bars. They’re great icebreakers at the evening mixers. But it does beg the question, what’s better: a mean, median, or mode? The statistician’s favorite answer is, “It depends.”

One of the things statistics prides itself on is summarizing data to make it more understandable. I mean, you could look at data listed in a column and quickly see some of the underlying characteristics—like whether certain values repeat themselves, do the values seem to increase or decrease in a pattern, or does the data seem to “bunch up” toward one central location. You can see these patterns if the data set is reasonably small, say less than fifty. But what if you have a column of 500, or 1000, or over a million? Scanning that column may not be very effective for large data sets. This is where statistics shines!

All statistical investigations involve two aspects: Location and Spread. Spread will be discussed in Part 2 of this series. Today I’ll focus on Location.

Location is a way to understand if the data tends to “bunch up” toward a central, well, location. It’s the one number that “best” represents the data set—typically that’s the mean, median, or mode. By far, the hands down winner of this trifecta, right or wrong, for better or worse, is the mean (also commonly known as the average). The mean is typically used even when it’s not very good at representing the location where most of the data tends to migrate.

There are several types of means, but the arithmetic average is the one most people probably think of. Warning! For the mathematically faint of heart, avert your eyes now. I’m listing the formula here: 

Formula to calculate the arithmetic mean. X-bar (a symbol showing an x with a line above it representing the sample mean) equals the summation for i equals one to n of x i divided by n.  Where i is a counter incrementing by one, x i represents each data point in the data set, and n is the total number of data points in the data set.

Despite the way it looks, it’s really easy. Ask any bowler what their average score was for three games and they’d simply add the three scores and divide by three. That’s it. The formula above is just mathematical mumbo jumbo for saying, “Add all the data values and divide by the total number of data points.”

So, if you had five data points: 2, 9, 4, 3, and 2, then 2+9+4+3+2=20. Hence, 20/5=4. That’s the mean. 

The median is a little more cumbersome to determine. You have to order all the data from smallest to largest and find the one smack-dab in the middle. So, using our five data points from above, you arrange them as 2, 2, 3, 4, and 9. The median is 3 because it’s right in the middle. There are two data points below it and two above.

That’s all there is to it if you have an odd number of data points (i.e., n=5, if you’re into practicing the terms defined in the formula above), but if the number of data points is even, then it gets slightly more complicated. So, if we add another 9 to our data set above (now n=6, see you’re getting it), then 2, 2, 3, 4, 9, and 9. Uh-oh, now there is no single value in the middle. No worries, just average the two middle points: (3+4)/2=3.5. That’s the median.

The mode can be even more cumbersome to find, but it’s also easy. Again, arrange the data from smallest to largest and find the value occurring most often. Using our original data set of five points above, we see 2 occurs twice and all the others occur once. Hence, 2 is the mode.

For grins, if we add that additional 9 back into the mix, then we find there are two modes, 2 and 9. We call that “bi-modal.” Three modes are tri-modal, but after two we usually just say “multimodal.”

Okay, so using the original data set with n=5, then the mean is 4, the median is 3, and the mode is 2. Which is the one number best representing your data set? We could arrange the data in a graph by value across the range, as shown below:

2

2 3 4         9

The median of 3 looks pretty representative, but it is quite a distance from 9. That’s an issue with medians. They ignore the magnitude of very large or small values. The mean of 4 gets closer to 9, but if 9 is really just an unusual value, then 4 may be a bit misleading. This is the issue with means. Extreme values “pull” the mean toward them. Is the mode of 2 representative? Well, it’s certainly close to the median of 3 and not too far from the mean of 4. It’s also a reasonable representation for that group of data bunched up between 2 and 4, but it’s at the complete opposite end from 9, where 9 is sitting all alone distanced from the others like a statistician at a single’s bar. That’s the issue with modes. Just a few extra duplicate values anywhere along the range of the data will qualify it as the official mode—regardless of whether it’s representative of the bulk of the data. And what if your data is multimodal? Then what?

So, if you’re forced to pick either the mean, median, or mode to represent a data set like this, then choose wisely. None of them may be all that good. You should really think long and hard about your situation at hand and how your choice will affect decision?making regarding your process.

“Wow! I used statistics because it’s supposed to do the thinking for me,” you mutter incredulously. “You mean statistics doesn’t always provide the true answer?” Yes, I do mean that. The tools can help, but if anyone tries to sell you statistics as absolute truth, then you should run kicking and screaming in the opposite direction. Statistics is not the decision maker. It is a tool—only a tool—for decision makers to use. Fundamental understanding of the process is the real arbiter of how the tool is used. A skilled carpenter can deftly wield a hammer to build a house, but some poor schlep statistician wannabe can’t expect the hammer to build the house for him.

So, which do you use: the mean, median, or mode?

Suppose you’re trying to calculate the mean salary of employees in the company where you work. You get the list of salaries (with no names, of course) from HR, and calculate the mean salary to be $500,000. Hmm, how does that relate to what you’re making?  Does it appear to be the one number that “best” represents the data set?

If you’re at my level in the corporate pecking order, a $500,000 average seems a bit high to believe. If you’re at the executive level, it may be low. Think about it, the CEO, president, and other muckety-mucks earn multiple millions including bonuses, stock options, etc. Those few outrageous salaries far greater than almost everyone else’s may “pull” the mean way up. Maybe the median, where 50% of the employees make more and 50% make less is a better choice for the location value.

Suppose you’re trying to find the mean diameter of a tube used in a pump your company produces. If the data looks like it tends to bunch up toward a central location, without a lot of unusually high or low values, then maybe the mean is appropriate.

If you’re looking to sell your house, you may consider the mode, the price most houses in your neighborhood are selling for during the same relative time frame as when you put your house on the market.

Of course, in all three of those examples, reasonable arguments can be made to use one of the other two statistics of location to best represent the data set. Again, we may quote the statistician’s favorite answer of, “It depends.”

The point is, don’t just blindly accept one because that’s what you remember your professor talking about when you woke in that one statistics class you were forced to take in college. Th-th-think about how you will apply that number. Graphing the distribution may help you decide.

Two distributions. One is unimodal and symmetric showing the mean, median, and mode are all equal. The other is skewed to the right (i.e., a long tail extends to the right) and show the mean, median, and mode are all different values.

If the distribution of data is approximately unimodal (i.e., one hump) and symmetric (i.e., split the graph down the middle and one half is identical to the other half), like a Normal Distribution (that will be discussed more in future posts), then the mean, median, and mode are the same number and it doesn’t matter which you use. However, if your distribution is skewed (i.e., long-tailed either to the left or right), then the mean, median, and mode will be different values.

If you’re using the data to perform a statistical method in order to make a judgement about your process, then the method is probably “parametric,” meaning the type of distribution the data represents is known, understood, and can be defined by the parameters of that distribution. There are methods for all types of distributions, but the most commonly used methods (e.g., t?test, confidence interval, ANOVA, etc.) rely on the data distribution being, more or less, unimodal and symmetric (like a Normal Distribution) in order to characterize the mean. So, if your data are not distributed as the parametric methods assumptions require, then you’ve just biased the result you will receive to make decisions on your process. (For more on bias, check out my post Beware the Tides of Bias.)

Nonparametric (sometimes called “distribution free”) methods are not as sensitive to the distribution shape of data and they are often used to characterize the median. However, if your data are distributed in some very unusual pattern—maybe looking like a topological map of the Himalayas—then nonparametric methods may not work very well either. An example is shown in the histogram below. I call it the “face” distribution because if you rotate it up ninety degrees it kind of looks like a face. Neither the mean nor the median characterizes the data very well and there are multiple modes. In order to find a legitimate value to represent the location, then your process should be producing measured values with some consistency.

Multimodal distribution with four discernible and separate peaks. The mean does not equal the median and there are multiple modes.

Yikes! You’ve got bigger issues with your process you need to deal with before picking the mean, median, or mode. However, people apply statistical methods to the type of data depicted above all the time and wonder why the decisions they made based on the results they received don’t work. “Statistics,” they moan, “What a waste of time!” Well don’t blame the tool because you used it incorrectly. You used a hammer to try tightening a bolt. What did you expect?

One reason people apply statistical methods to this type of data is because they dive right into the method without even looking at the data. Stop and use a little common sense first. A simple histogram would have shown, no matter what method they used, the shape of the distribution compromised the legitimacy of the method.

The best place to start any statistical analysis is with graphing the data—histograms, trend charts, or other pictures to visualize patterns before you start shooting your six-gun of stat methods at the data. (Oh boy! Graphing. More future posts.)

If location was all there was to statistical investigations, then statistics would be a few bricks short of a full load. Fortunately, there’s spread to help sort out the mess. Stay tuned for Part 2.

Gale LaRoche

Senior Human Resources Consultant

3 年

Great article!

回复

要查看或添加评论,请登录

David Tomczyk的更多文章

社区洞察

其他会员也浏览了