47 UX Metrics, Methods, & Measurement Articles from 2024

47 UX Metrics, Methods, & Measurement Articles from 2024

Happy New Year from all of us at MeasuringU?!

In 2024, we posted 47 articles and continued to add features to our MUiQ? UX testing platform to make it even easier to develop studies and analyze results.

We hosted our 11th UX Measurement Bootcamp—a blended virtual event attended by an international group of UX practitioners who completed a combination of MeasuringUniversity? online courses and live Zoom sessions. It was a challenging four weeks of intensive training on UX methods, metrics, and measurements; additionally, groups worked together to design a study in MUiQ, collect and analyze data, and prepare a report.

Through our MeasuringUniversity online course platform, we continue to offer access to our webinars and full courses on:

In addition to publishing our latest book, Surveying the User Experience, we’ve continued to conduct research and go deep into UX topics, including UX metrics, methods, statistics, and industry benchmarks.

Want to catch up on what you’ve missed from MeasuringU in 2024? This article is a great place to start!

Read the full article on MeasuringU's Blog


Standardized UX Metrics

This year, our work on standardized UX metrics included four articles on the development and validation of the Perceived Website Clutter Questionnaire (PWCQ), three on using the UX-Lite?, three on the relationship between the NPS and recommendation behavior, two on the System Usability Questionnaire (SUS), and six on various other metrics (SEQ?, SMEQ, SUPR-Q?, TAC-10?, box scoring, and click scales).

Perceived Website Clutter Questionnaire (PWCQ)

We often hear comments from users and clients about cluttered websites, indicating a need for a psychometrically qualified standardized measure of perceived clutter. We analyzed perceived clutter data collected in SUPR-Q surveys from 2022–2023, presenting our results in a peer-reviewed paper published in the International Journal of Human-Computer Interaction. Based on this paper, we wrote four articles documenting the development and validation of this new measure of perceived website clutter.

  • In Search of a Clutter Metric for Websites. We started our research with a literature review, which found that the everyday conception of clutter includes two components: the extent to which needed objects are disorganized, and the need to discard unnecessary objects. Most research on the measurement of clutter in UI design has been focused on objective measurement of clutter, but for UX evaluation, a more promising line of research is subjective measurement from the fields of market research and advertising.

  • Building a Website Clutter Questionnaire. Based on the literature review, we developed 16 items (six for content clutter and ten for design clutter) for an initial version of the questionnaire. We included those items plus one rating of overall clutter in eight SUPR-Q surveys conducted from April 2022 to January 2023 (2,761 responses collected from 57 websites). We used half the data for exploratory analyses in this article (holding the other half of the data back for future confirmatory analyses). The items aligned with the expected factors, and we used multiple regression to increase the questionnaire efficiency (retaining five items) while still keeping scale and subscale reliability high.

  • Confirming the Perceived Website Clutter Questionnaire (PWCQ). Using the independent data held back for confirmatory analysis of the clutter questionnaire, we found excellent fit for the five-item version of the PWCQ. We expect UX researchers and practitioners to be able to use this version of the clutter questionnaire when the research context is similar to the websites we studied in our consumer surveys. We don’t anticipate serious barriers to using the clutter questionnaire in other similar contexts, including task-based studies, mobile apps, and very cluttered web/mobile UIs, but because that research has not yet been conducted, UX researchers and practitioners should exercise due caution.

  • Incorporating Clutter in the SUPR-Q Measurement Framework. Our final analysis in this series was to use structural equation modeling to investigate the convergent and divergent validity of our measure of perceived clutter in the context of the SUPR-Q measurement framework (Figure 1). The model had good fit statistics and provided strong evidence for the distracting and detracting effects of clutter, showing that perceived clutter drags down ratings of Usability and Appearance (distracts), which in turn drags down ratings of Loyalty (detracts).


UX-Lite

The UX-Lite can be thought of as a short version of the influential Technology Acceptance Model. It has two five-point items, one measuring perceived ease of use (Ease) and the other perceived usefulness (Usefulness). Our three UX-Lite articles this year started with a review of its scoring and interpretation, followed by two explorations of statistical models of the relationship of the UX-Lite with tech adoption and future behavior (extracted and summarized from a peer-reviewed paper we published in the International Journal of Human-Computer Interaction).

  • How to Score and Interpret the UX-Lite. In this review of scoring and interpreting the UX-Lite, we cover administration, interpolation of raw scores to a 0–100-point scale, conversion to percentile scores (interpreted with S-plots and scatterplots), estimating SUS scores from the UX-Lite, and interpretation of scores with a curved grading scale.

  • Can the UX-Lite Measure Tech Adoption? Yes, it can. To understand how well the UX-Lite could predict technology adoption, we conducted three assessments using data from 2,412 respondents to see how well it matched or bested items from a modified version of the Technology Acceptance Model (mTAM) that measured perceived ease of use and perceived usefulness. We found that the UX-Lite was reliable and valid, and it fit expected prediction models, demonstrating that UX researchers and practitioners can use the two-item UX-Lite in their work to effectively and efficiently measure perceived ease and usefulness and that improvements in ease and usefulness have a positive influence on important behavioral intentions.

  • Is the UX-Lite Predictive of Future Behavior? Yes, it is. We reached out to the respondents from the earlier study on tech adoption to see who was interested in participating in a follow-up study, ultimately analyzing 321 responses. Not only was the UX-Lite predictive of ratings of overall experience and behavioral intentions, but it was also predictive of usage behavior driven by the behavioral intention to use.

Recommendation

The heart of the Net Promoter Score (NPS) is the likelihood-to-recommend (LTR), which is the behavioral intention that precedes actual recommendation behavior. We published three articles on the general topic of recommendation behaviors (including recommendations against something).

  • How Many People Actually Recommend? More than you might expect. Those expressing any intention to recommend have a surprisingly high rate of recommendation behavior. Aggregating across four longitudinal data sources (two external and two MeasuringU), we estimate that between 50% and 60% of promoters ultimately recommend making them about three times as likely to recommend as detractors.

  • Does the NPS Properly Measure Recommending Against a Brand? It is significantly associated with negative word-of-mouth (NWOM) but isn’t necessarily the best measure of recommending against. Our literature review of the NPS’s ability to measure recommending against a brand (i.e., NWOM, discouragement) found that a bipolar scale may better predict NWOM, but due to a lack of benchmarks, such a scale is not a suitable replacement for the NPS.

  • How Well Does the Net Promoter Score Measure Likelihood-to-Discourage? Likelihood-to-recommend measures likelihood-to-discourage, but not perfectly. As illustrated in Figure 2, the magnitude of the correlation between likelihood-to-recommend and likelihood-to-discourage is fairly large, but ratings of LTR only account for about a quarter to a third of the variation in ratings of likelihood-to-discourage. There is clear value in measuring LTR, but for a clearer picture of the full range of behavioral intention, there appears to be value in also collecting ratings of likelihood-to-discourage.


System Usability Scale

The System Usability Scale (SUS) is one of the most thoroughly researched standardized UX questionnaires. Even so, there are still opportunities to improve our understanding of its measurement properties. One of our SUS articles addressed whether the SUS is now antiquated, given that we’re approaching its 40th birthday. The other explored whether UX researchers should report SUS means or medians.

  • Is the SUS Too Antiquated? It shows its age in some ways but is far from antiquated. Despite some valid criticisms (word choice, writing style, redundancy, inapplicability in some settings, problems caused by alternating tone), the things the SUS does well (it’s reliable and valid and has excellent published norms) make it an excellent choice as a measure of perceived usability.

  • Should You Use the Mean or Median of the SUS? We recommend the mean over the median. A comparison of the means and medians of SUS scores from 18,853 individuals who used the SUS to rate the perceived usability of 210 products and services found a statistically significant but small two-point difference between SUS means and medians, making the median problematic when using existing methods of interpreting the SUS.


Other UX Metrics

We published six additional articles on various other metrics (SEQ, SMEQ, SUPR-Q, TAC-10, box scoring, and click scales).

  • The Evolution of the Single Ease Question (SEQ). The Single Ease Question (SEQ) has been a popular measure of perceived ease for over 15 years. No other single-item measure of perceived task ease has a sufficient normative database for the assignment of percentiles to scores, has been calibrated to task completion and times, or can be interpreted with an adjective scale. Its format has varied over the years, but research we’ve conducted since 2022 has shown that numerous variations have little to no effect on respondent behaviors.

  • Do the Interior Labels of the SMEQ Affect Its Scores? Yes, they do. Unlike the typically negligible effects of interior labeling on more commonly used rating scales, the interior labels of the Subjective Mental Effort Questionnaire (SMEQ) appear to have a relatively strong impact on ratings (an effect size of about 10% of the scale range), so we recommend using the standard version with the interior labels.

  • Validating the Basic SUPR-Q Measurement Model. The development of the Standardized User Experience Percentile Rank Questionnaire (SUPR-Q) goes back to 2011 (published in 2015). Our new confirmatory psychometric analyses (CFA and regression model) of the basic SUPR-Q model using data from retrospective studies of eight sectors (n = 2,761 across 57 websites) found strong evidence of reliability and validity, including how the antecedent constructs account for almost half the variation in Loyalty scores (Figure 3).
  • 12 Things to Know About Using the TAC-10 to Measure Tech Savviness. In a series of articles (and at UXPA 2024), we reviewed the findings of eight years of research into measuring tech savviness. After analyzing thousands of participants’ data to understand how measures of tech savviness predict performance, we developed a questionnaire called the TAC-10 (the version of our Technical Activity Checklist with ten items). The TAC-10 is a reliable (consistent) and valid (predictive) measure of tech savviness. This article describes 12 things UX practitioners should know about the TAC-10.

  • Top Box, Top-Two Box, Bottom Box, Or Net Box? For most (but not all) situations, use top box. Detailed analysis of a rating of the ease of airline seat selection led to several insights about how to score rating scale items. Both means and top-box scores can be helpful in UX research because they answer different questions. We prefer top box over other types of box scores when the research focus is on the prediction of future behavior.

  • Are Click Scales More Sensitive than Radio Button Scales? No, they aren’t, at least for the SEQ. We collected data from 200 participants who used the click version of the SEQ to rate the difficulty of completing five online tasks that varied significantly in how hard they were. The click SEQ was not more sensitive than the seven-point standard (radio button) version.


UX Methods

This year our UX methods focus was on surveys (six articles), usability testing (three articles), ChatGPT (two articles), opt-in online panels (one article), and click testing (one article).

Surveys

We published six articles about survey methodology based on our latest book, Surveying the User Experience.

  • Foundations of Survey Design in UX Research. This article provides brief descriptions of various survey topics, including how long surveys have been around; the differences between censuses, polls, and questionnaires; different types of surveys; the advantages and origins of standardized UX questionnaires; when surveys are the right research method; and the first steps when planning a survey.

  • Defining and Finding Participants for Survey Research. In our experience, one of the most soul-crushing difficulties of running surveys is the process of defining and finding participants. This article briefly covers topics such as defining and finding target participants, using online panels, getting people to participate, and compensating participants.

  • An Overview of Survey Sampling Strategies. It’s important to understand how sampling can provide a good picture of a population even if you can’t measure all or even most of its members. This article provides brief answers to key questions about survey sampling, such as what’s the first step, what are the different ways to collect probability and nonprobability samples, and whether it’s OK to use a nonprobability sample.

  • A Blueprint for Writing Survey Questions. This article provides a seven-point blueprint of what to think about when crafting survey questions: understand the anatomy of a survey item, determine the type of survey item you need, start writing, avoid common practices that can cause misinterpretation, review your survey questions to make them clearer, be aware of the reasons people forget, and help people remember.

  • Changes to Rating Scale Formats Can Matter, But Usually Not That Much. Over the past few years, we have investigated and quantified 21 possible effects on rating scales. We summarized the literature and, in many cases, conducted primary research with thousands of participants and either replicated, qualified, or contradicted findings from the literature. We briefly reviewed the 21 effects shown in Figure 4.

Usability Testing

We published three articles about usability testing, including key takeaways from think-aloud (TA) studies published in 2023, measurements of the typical length of unmoderated UX tasks, and no-show rates for moderated studies.

  • 10 Key Takeaways from the Latest Research on Thinking Aloud in Usability Testing. In the last two years, we’ve researched the unmoderated think-aloud (TA) method extensively, contributing to the field’s understanding and advancing the method. This article summarizes?ten key findings from that research (e.g., TA identifies around 30% more problems, a lot of TA time is spent being silent, and most verbalizations describe actions).

  • How Long Are Typical Unmoderated UX Tasks? It depends on the type of task. A common logistical consideration when planning a task-based usability study is how much time you should plan for a task. Our estimates of the 75th percentile times to use for planning are 20 seconds for tree tests, 90 seconds for non-TA tasks, and 120 seconds for TA tasks.

ChatGPT

We continued research started in 2023 on the potential use of generative AI tools such as ChatGPT by UX researchers with two new articles, one on card sorting and one on tree testing.

  • Comparing ChatGPT to Card Sorting Results. Our comparison of ChatGPT’s ability to sort items into groups and to appropriately name the groups with the groups synthesized by human researchers from a standard open card sort found a strong similarity in numbers and names of categories. Items matched most of the time, the interrater reliability between the two methods was moderate to substantial, and there weren’t any obviously bad ChatGPT placements.

  • Using ChatGPT in Tree Testing: Experimental Results. Based on data collected with multiple iterations of ChatGPT and 33 participants finding the location of target items in a tree structure based on the IRS website and using the SEQ to assess perceived task difficulty, we found that ChatGPT is not suitable for estimating how well humans will find items in a tree test. However, ChatGPT predicted people’s ease ratings of the search tasks with reasonable accuracy.

Miscellaneous Methods

The three articles in this section provide information about the reliability of UI trap cards, the accuracy of opt-in panels, and first-click times on websites versus images.

  • Are Opt-In Online Panels Too Inaccurate? Opt-in panels are accurate enough for most UX research. An opt-in panel is a type of nonprobability panel. We re-analyzed data published by Pew Research on 28 variables and found that only four variables had opt-in errors greater than 10%. It’s rare, however, for critical UX research questions to need to be matched to census-type benchmarks. Any demographic targets regarding age, gender, or the types of items included in the Pew study can be addressed with quotas. Professional UX researchers develop surveys that enable the detection of most types of bad actors.

  • First Click Times on Websites Versus Images. Two studies comparing first click times for tasks on live and image versions of eight websites found participants take about 50% longer to make their first click on an image of a website than on the live website (possibly due to impoverished clickability cues on images). Researchers using click tests to compare images with live websites should pay more attention to click location data than to click times.


Statistical Topics

Our five articles on statistical topics included three on sample size estimation for usability studies, assessing interrater reliability, and how to analyze click data in standalone studies.

Sample Sizes for Usability Studies

The three articles in this section take different approaches to illustrating sample size estimation principles for usability studies.

  • Sample Sizes for Usability Studies: One Size Does Not Fit All. How many participants do you need for a usability study? It depends on the study type (discovery, estimation, and comparison), but even within study types, one size does not fit all.

  • What You Get with Specific Sample Sizes in UX Problem Discovery Studies. One sample size doesn’t fit all research needs for problem discovery studies such as formative usability studies. Fortunately, tabular and graphic aids can help UX researchers determine and justify sample sizes for these types of studies. You can use the table and graphs presented in this article to understand what you can expect to get with different sample sizes for problem discovery studies (e.g., Figure 5). This can be useful for initial sample size planning and understanding the consequences of events that lead to the reduction of the initially planned sample size.
  • How Many People Do You Need to See Trip on Your Carpet Before Fixing It? One, or even better, zero—but that’s not the real point of the question. This is a classic parable-like question in UX lore. It’s helpful in some ways but deceptive in others, especially when applied to formative usability studies. Basically, the carpet parable fails to apply to formative usability testing because it’s focused on a single known problem that has the potential for catastrophic consequences rather than discovery of a set of unknown problems that vary in their likelihood, severity (observed and potential), and priority for fixing.

Miscellaneous Stats

The two articles in this section explain how to assess the interrater reliability of UX researchers’ judgments and the basic suite of methods appropriate for the analysis of standalone studies that collect click data.

  • Assessing Interrater Reliability in UX Research. Two fundamental UX research activities are classification and discovery. This article shows that even though they both produce data that can be organized in 2×2 tables, their interpretation and analysis require different interrater reliability methods (kappa for classification and any-2 agreement for discovery).

  • How to Analyze Click Test Metrics in Standalone Studies. In standalone analyses of click data, you can use confidence intervals around your sample data to infer the plausible range of a population parameter such as a mean or proportion. To help you know which one to use, this article covers appropriate metrics and methods for various research questions. Many of these computations are done automatically in MUiQ, or you can use statistical packages or online calculators.


Data Visualization

We published two articles on data visualization, one on an experiment comparing 2D and 1D bar graphs and the other on reporting percentages computed from small samples.

  • An Experiment Comparing 2D and 1D Bar Graphs. An analysis of 105 participants comparing 2D vs. 1D bar graphs found no difference in selection accuracy, slower selection of the 1D version (at least in part due to Fitts’ Law), and a strong user preference for 2D.

  • You Can Report Percentages with Small Samples, but Should You? In most situations, you should. Differences in percentages are easier to judge than differences in fractions when the denominators are different, and users strongly prefer percentages over fractions. If you want people consuming your research to make accurate and rapid comparisons of data, choose percentages over fractions (even better, present both).

UX Industry Reports

We conducted mixed-methods benchmark studies using the SUPR-Q and Net Promoter Scores across a wide range of online consumer services. Thanks to all of you who have purchased our reports. The proceeds from these sales fund the original research we post on MeasuringU. We also published two UXPA profession articles based on the 2024 UXPA salary survey.

SUPR-Q Benchmark Studies

In 2024, we published the results of six UX and Net Promoter benchmark studies; SUPR-Q scores are included in a SUPR-Q license.

  • Home Furniture. Our survey (n = 324) of six furniture websites (Ashley, Crate & Barrel, Ikea, Pottery Barn, Wayfair, World Market) found furniture websites have a generally good experience, but users were concerned with clutter, overwhelming inventories, and product quality [full report].

  • Banking. A total of 285 users of banking websites in the U.S. rated their experience with one of six banking websites. The UX of banking websites is about average, with users reporting issues with clutter, problems managing credit and debit cards, and security concerns[full report].

  • Federal Government. Our survey (n = 255) of five federal government websites found wide differences in UX by site. Overall, people seem to trust U.S. government websites, findability challenges were a major problem, sometimes the amount of information was overwhelming, some complex language was hard to understand, and lack of customer support can be problematic [full report].

  • Dating. Our evaluation of seven dating apps and websites (n = 280) found that users strongly prefer mobile apps over websites for online dating and are largely dissatisfied due to problems with dishonest users, dating scams, poor matching algorithms, and needing to pay extra for premium features [full report]. Despite dating app attitudes being at an all-time low, they remain wildly popular, with over 300 million people using dating apps worldwide, and about 20 million paying for premium features. After all, for many, finding their person could be just one swipe away.

  • Social Media. For our survey (n = 324) of six social media apps and websites, we found that TikTok dominated the UX metrics we collected, significantly outperforming other apps on usability, preferred content, and mental health impact (X scored the lowest) [full report].

UXPA 2024 Salary Survey

Every few years, we assist our friends at the UXPA to help the UX community understand the latest compensation, skills, and composition of the UX profession, most recently for their 2024 survey (two articles).

  • User Experience Salaries & Calculator (2024). We used data from the 2024 UXPA salary survey and a 2024 User Interviews salary survey to create a UX salary calculator. The calculator predicts UX salaries as a function of location (Country/U.S. Region), Job Level, Company Size, and Years of UX Experience (with 95% confidence intervals and information about the number of matching cases from the underlying salary data).

  • Do UX Certifications Pay Off? Certification does not guarantee higher pay, but there may be other benefits. Our exploration of the value of UX certification in 2024 while looking back to 2017 has found consistency in some key findings and some discrepancies. The overall effect of certification on salary remained nonsignificant. Most respondents did not think certification helped to increase their pay, but in addition to nontangible benefits like improved skills, certification likely plays a minor role in helping new practitioners break into the field.

Coming Up in 2025

For 2025, stay tuned for a year’s worth of new articles, industry reports, webinars, new MeasuringUniversity offerings, our annual boot camp, and a new peer-reviewed journal paper on updating the SUPR-Qm (February 2025 issue of the Journal of User Experience).

?????? ?

Certified Leadership Coach: Helping Teams to Unlock their Full Potentials and Business Professionals

1 个月

Subscribed. Thank you.

回复

要查看或添加评论,请登录

MeasuringU的更多文章

社区洞察

其他会员也浏览了