Changes to Rating Scale Formats Can Matter, But Usually Not That Much
MeasuringU
UX research and software supporting all stages of the user and customer experience.
Few things seem to elicit more opinions, exaggerations, and accusations than rating scale response options.
From the “right” number of points, the use (or not) of labels, and the presentation order (to name a few), it seems we all have thoughts on what to do (or not to do) regarding conventions and “rules” when selecting response options.
Rules and conventions often provide reasonable advice for researchers, but not always. You don’t want to have to reinvent the survey wheel each time you need to collect data. But you also don’t want to rely on a shaky foundation.
The concern many researchers have is that if you use the “wrong” format you’ll skew your results. And that’s a legitimate concern. After all, why go through all the trouble and cost of building and collecting data only to be misled by the results? But what if the “cure” for potential errors in responses is worse than the putative problem? It helps to know first if there’s a problem and how large (impactful) it is.
In writing Surveying the User Experience, we were surprised by how flimsy some of the rationales were for certain conventions, or, when we did find deleterious impacts on responses, how small they were.
While you don’t want to callously ignore potential biases and errors in responses, we’ve found that most decisions in UX and customer research are based on generally large effects. For example, you rarely need to measure the sentiment in a customer population to within 1%. We usually see decisions impacted more when differences are in the 20% or 10% range.
For example, 80% approval versus 60% (a 20% difference) is large enough to affect an important decision, but a 1% difference (80% vs. 79%) will usually not be enough. Maybe 1% is enough in special circumstances, but if the stakes are that high, you’ll know.
Over the past few years, we have investigated and quantified 21 possible effects on rating scales. We summarized the literature and, in many cases, conducted primary research with thousands of participants and either replicated, qualified, or contradicted findings from the literature.
In this article, we briefly review these 21 effects.
Read the full article on MeasuringU's Blog
Discussion
Changes to rating scale formats can matter, but usually not that much.
As shown in Figure 1 and Table 1, more than half of the manipulations we investigated (11/21) had less than a 1% impact on outcomes. Only four manipulations had estimated effects of 3% or more, and only two of those were statistically significant.
The largest effect we found (22%) was from an experiment we conducted to investigate a poorly informed recommendation to use just three response options to measure the likelihood to recommend (would not recommend, unsure, would recommend) based on the mistaken belief that people have trouble responding to an eleven-point scale (0–10). Trying to fix this nonexistent problem would create a real problem—the inability to identify respondents with a very strong intention to recommend, making this a good example of the “cure” being worse than the “disease.”
领英推荐
The smallest effect was a difference of 0.1% in selection rates for a select-all-that-apply grid versus a forced choice yes/no grid. The format had virtually no effect on selection rates, but only 13% of participants indicated a preference for the forced choice yes/no grid, while over 70% preferred clicking the checkboxes they wanted in the select-all-that-apply grid.
Most manipulations had minimal impact. One thing is clear from Figure 1: there were no effects we studied where the estimated effect was exactly 0. On the one hand, that may fuel the concern that you should be even more cautious because this shows that changes do impact results. On the other hand, what this actually shows is something central to hypothesis testing. When you use a large enough sample size you will almost always find a difference, and at any sample size, it’s unlikely to get a difference of exactly 0. For practical significance, it’s not whether there’s a difference but the size of the difference that matters. In seven of the 21 manipulations in Table 1, the difference was less than or equal to half of a percent, and in 13 of 21 manipulations, the estimated difference was less than or equal to 1%.
There could be other effects. Although we have investigated many potential impacts on rating scales, other manipulations could possibly (and even likely) affect your data. After all, our largest effect came from some bad advice, so in the future, it’s certainly possible new cures will be proposed that cause more harm than good. If we see any, we’ll test and let you know!
We provided links in Table 1 so you can explore the literature that documents studies conducted on these 21 manipulations. Or, for a complete discussion of these effects and the supporting sources, see pp. 116-256 in Surveying the User Experience. We also have a companion course that follows the book on MeasuringUniversity.com.
Takeaway: Changes to rating scales matter, but usually not that much in applied UX research. Focus more on doing something about your findings than arguing over the number of points in a scale (or any of the other manipulations that have negligible effects on outcomes).
Upcoming Feature in MUiQ:
Moderated Studies
We are excited to announce that beginning in June,?MUiQ will?allow researchers to run moderated sessions directly in the platform.
With our?initial feature release, researchers will be able to:
Moderated Studies in MUiQ?will add a significant set of capabilities to further support every stage?of UX research.
Reach out today to learn more about MUiQ!