Automatic Item Development and Validation using LLMs via AI-GENIE: In Silica and Human Results
Hudson Golino
Associate Professor of Quantitative Methods at the Department of Psychology, University of Virginia
Are you interested in LLMs, Network Psychometrics, Item Development and Validation, and how to automatically develop AND validate items in-silica?
In the work below we show how to: develop items using LLMs, implement structural validity in silica, and how it matches real humans.
Updated Pre-Print: Generative Psychometrics via AI-GENIE: Automatic Item Generation and Validation via Network-Integrated Evaluation (Link in the comments)
The problem:The development of reliable and valid psychological scales is a resource-intensive and challenging process, particularly during the crucial stages of item generation and validation. Traditional approaches and methods require extensive human intervention, making the process time-consuming and costly.
The solution: In response to these challenges, this paper introduces Automatic Item Generation and Validation via Network-Integrated Evalua- tion (AI-GENIE), a novel method for fully automated item development and validation in silica, leveraging the capabilities of large language models (LLMs) and network psychometric techniques. This new approach has the potential to assist in scale development by significantly reducing the reliance on expert input while maintaining the quality and validity of the generated items.
The methodology: The pipeline combines the latest open-source LLMs and generative AI with advances in network psychometrics to facilitate scale generation, selection, and validation. Our process eliminates the need to generate hundreds of items by content experts, recruit diverse and experienced researchers, administer the items to (as well as compensate) thousands of participants, and employ modern psychometric methods in the data collected in this process.
AI-GENIE has the potential to save researchers months, if not years, of time and thousands, if not tens of thousands, of U.S. dollars. Similar research has already found evidence that language embeddings and/or language models can approximate psychometric parameters traditionally requiring empirical data (Fyffe, Lee, & Kaplan, 2024; Guenole et al., 2024), clarify construct relationships and streamline psychological taxonomies (Cutler & Condon, 2023; Wulff & Mata, 2023), and potentially generate participant responses (Huang et al., 2024).
Our fully automated method is, to our knowledge, the first to generate, assess, and validate the quality of AI-generated items for psychometric scales (but see Laverghetta Jr, Luchini, Linell, Reiter-Palmon, & Beaty, 2024 for some recent propositions to develop and select items via LLMs).
Results 1: Monte-Carlo Simulation
Our results show that AI-GENIE is a robust and capable tool for item pool reduction. As can be seen in Figure 2 and Table 1, the average final NMI is an improvement over the average initial NMI for all conditions. The two EGA models used, TMFG and glasso, performed comparably generally, though TMFG yielded slightly better results than glasso for some conditions. This result is consistent with previous findings that suggest TMFG as a powerful tool in the context of text data (Golino et al., 2020). The highest and second highest temperature Llama 3 model showed the most and second most impressive average improvement of 17.8 and 16.96 NMI points respectively across all trait types, EGA models, and temperatures. Also impressive is GPT 3.5’s lowest temperature model with an overall average change in NMI of 13.84 points and Gemma 2’s lowest temperature model with 13.30 points. Mixtral’s standard temperature model showed the most modest improvement of 9.78 points, though even this relatively small change shows that the AI- GENIE is a sophisticated tool that is equipped to handle items generated by any of the five models at any temperature setting for any Big Five personality traits.
领英推荐
Toy Example:
Results 2: Does the In-Silica Results Match Real Human Results?
To put everything together and demonstrate the validity of the surveys produced by AI-GENIE, we created five new Big Five personality surveys using Gemma 2, GPT 3.5, GPT 4o, Llama 3, and Mixtral. Each model generated at least 40 items using a temperature = 1 and the EBICglasso method to validate the items in AI-GENIE.
The resulting surveys had items ranging from 28 (Gemma 2) to 35 (GPT 4o) with at least 4 items per Big Five trait.
Participants
Five nationally representative samples in the United States were recruited from Prolific. Each sample consisted 1,000 people who completed one of the five Big Five surveys created and validated by AI-GENIE. Descriptive statistics for each sample and corresponding model are provided in Table 4. Participants were compensated for $12/hour with the median duration to completion for each sample ranging from 3-4 minutes. Participants were compensated but excluded from analysis if they demonstrated straight line responding (zero variance in their responses) or had responded with a single response (e.g., “Agree”) for 95% or more of the items.
All surveys were administered independently over Qualtrics and, based on exclusion criteria on Prolific, people were only allowed to complete one of the five surveys. Participants provided consent, completed a survey, and concluded with a set of demographics questions (age, race/ethnicity, gender identity; Table 4). This study and subsequent surveys were all approved by the University’s institutional review board.
NMI Between Theoretical and Empirical Results, and between the In-Silica and Empirical Results:
EGA/BootEGA/Item Stability for the 5 Empirical (real human) Samples:
Conclusion:
The significance of these findings lies in the potential of AI-GENIE to enhance psychological assessment. By reducing the time, cost, and human resources required for traditional scale development, our approach makes the process more accessible to researchers and practitioners across various domains. The ability to quickly generate and validate large item pools with minimal human intervention could lead to a broader availabil- ity of high-quality psychological assessments, ultimately improving the measurement and understanding of psychological constructs in diverse populations. Moreover, the methodology’s success in refining item pools for the Big Five personality traits suggests that it could be applied to other psychological constructs, poten- tially transforming the way psychological scales are developed and validated. This advancement represents a significant step forward in the integration of AI technologies into psychological research and practice, paving the way for more efficient, scalable, and reliable assessment tools.
The AIGENIE R Paper is going to be released soon.
The pre-print can be found here:
GES Senior Postdoctoral Research Fellow at University of Johannesburg. Psychometrician|| Data analyst|| R language expert|| Educational Measurement and assessment expert|| Data scientist|| AI in education
3 个月This is a great way to supercharge and electrify scale development and validation process in educational assessment. Very inspiring and insightful. Congratulations ??