登录查看更多内容

Automatic Item Development and Validation using LLMs via AI-GENIE: In Silica and Human Results

Hudson Golino

Associate Professor of Quantitative Methods at the Department of Psychology, University of Virginia

发布日期: 2024年11月27日

Are you interested in LLMs, Network Psychometrics, Item Development and Validation, and how to automatically develop AND validate items in-silica?

In the work below we show how to: develop items using LLMs, implement structural validity in silica, and how it matches real humans.

Updated Pre-Print: Generative Psychometrics via AI-GENIE: Automatic Item Generation and Validation via Network-Integrated Evaluation (Link in the comments)

The problem:The development of reliable and valid psychological scales is a resource-intensive and challenging process, particularly during the crucial stages of item generation and validation. Traditional approaches and methods require extensive human intervention, making the process time-consuming and costly.

The solution: In response to these challenges, this paper introduces Automatic Item Generation and Validation via Network-Integrated Evalua- tion (AI-GENIE), a novel method for fully automated item development and validation in silica, leveraging the capabilities of large language models (LLMs) and network psychometric techniques. This new approach has the potential to assist in scale development by significantly reducing the reliance on expert input while maintaining the quality and validity of the generated items.

The methodology: The pipeline combines the latest open-source LLMs and generative AI with advances in network psychometrics to facilitate scale generation, selection, and validation. Our process eliminates the need to generate hundreds of items by content experts, recruit diverse and experienced researchers, administer the items to (as well as compensate) thousands of participants, and employ modern psychometric methods in the data collected in this process.

AI-GENIE has the potential to save researchers months, if not years, of time and thousands, if not tens of thousands, of U.S. dollars. Similar research has already found evidence that language embeddings and/or language models can approximate psychometric parameters traditionally requiring empirical data (Fyffe, Lee, & Kaplan, 2024; Guenole et al., 2024), clarify construct relationships and streamline psychological taxonomies (Cutler & Condon, 2023; Wulff & Mata, 2023), and potentially generate participant responses (Huang et al., 2024).

Our fully automated method is, to our knowledge, the first to generate, assess, and validate the quality of AI-generated items for psychometric scales (but see Laverghetta Jr, Luchini, Linell, Reiter-Palmon, & Beaty, 2024 for some recent propositions to develop and select items via LLMs).

Results 1: Monte-Carlo Simulation

Our results show that AI-GENIE is a robust and capable tool for item pool reduction. As can be seen in Figure 2 and Table 1, the average final NMI is an improvement over the average initial NMI for all conditions. The two EGA models used, TMFG and glasso, performed comparably generally, though TMFG yielded slightly better results than glasso for some conditions. This result is consistent with previous findings that suggest TMFG as a powerful tool in the context of text data (Golino et al., 2020). The highest and second highest temperature Llama 3 model showed the most and second most impressive average improvement of 17.8 and 16.96 NMI points respectively across all trait types, EGA models, and temperatures. Also impressive is GPT 3.5’s lowest temperature model with an overall average change in NMI of 13.84 points and Gemma 2’s lowest temperature model with 13.30 points. Mixtral’s standard temperature model showed the most modest improvement of 9.78 points, though even this relatively small change shows that the AI- GENIE is a sophisticated tool that is equipped to handle items generated by any of the five models at any temperature setting for any Big Five personality traits.

领英推荐

Interpretability vs. Explainability – How do they…

Algolia 6 个月前

Artificial Intelligence and Human Cognition: A Dual…

Pratibha Kumari J. 7 个月前

Artificial General Intelligence

Valify Solutions 7 个月前

Toy Example:

Results 2: Does the In-Silica Results Match Real Human Results?

To put everything together and demonstrate the validity of the surveys produced by AI-GENIE, we created five new Big Five personality surveys using Gemma 2, GPT 3.5, GPT 4o, Llama 3, and Mixtral. Each model generated at least 40 items using a temperature = 1 and the EBICglasso method to validate the items in AI-GENIE.

The resulting surveys had items ranging from 28 (Gemma 2) to 35 (GPT 4o) with at least 4 items per Big Five trait.

Participants

Five nationally representative samples in the United States were recruited from Prolific. Each sample consisted 1,000 people who completed one of the five Big Five surveys created and validated by AI-GENIE. Descriptive statistics for each sample and corresponding model are provided in Table 4. Participants were compensated for $12/hour with the median duration to completion for each sample ranging from 3-4 minutes. Participants were compensated but excluded from analysis if they demonstrated straight line responding (zero variance in their responses) or had responded with a single response (e.g., “Agree”) for 95% or more of the items.

All surveys were administered independently over Qualtrics and, based on exclusion criteria on Prolific, people were only allowed to complete one of the five surveys. Participants provided consent, completed a survey, and concluded with a set of demographics questions (age, race/ethnicity, gender identity; Table 4). This study and subsequent surveys were all approved by the University’s institutional review board.

NMI Between Theoretical and Empirical Results, and between the In-Silica and Empirical Results:

EGA/BootEGA/Item Stability for the 5 Empirical (real human) Samples:

Conclusion:

The significance of these findings lies in the potential of AI-GENIE to enhance psychological assessment. By reducing the time, cost, and human resources required for traditional scale development, our approach makes the process more accessible to researchers and practitioners across various domains. The ability to quickly generate and validate large item pools with minimal human intervention could lead to a broader availabil- ity of high-quality psychological assessments, ultimately improving the measurement and understanding of psychological constructs in diverse populations. Moreover, the methodology’s success in refining item pools for the Big Five personality traits suggests that it could be applied to other psychological constructs, poten- tially transforming the way psychological scales are developed and validated. This advancement represents a significant step forward in the integration of AI technologies into psychological research and practice, paving the way for more efficient, scalable, and reliable assessment tools.

The AIGENIE R Paper is going to be released soon.

The pre-print can be found here:

https://osf.io/preprints/psyarxiv/fgbj4

Kunle Ayanwale

GES Senior Postdoctoral Research Fellow at University of Johannesburg. Psychometrician|| Data analyst|| R language expert|| Educational Measurement and assessment expert|| Data scientist|| AI in education

3 个月

This is a great way to supercharge and electrify scale development and validation process in educational assessment. Very inspiring and insightful. Congratulations ??

1 次回应

要查看或添加评论，请登录

Hudson Golino的更多文章

Optimizing LLM Embeddings for Clustering

2024年11月26日

Optimizing LLM Embeddings for Clustering

Optimization of the landscape of LLM embeddings using Dynamic Exploratory Graph Analysis. Problem: How can the accuracy…
The Exploratory Graph Model

2024年11月12日

The Exploratory Graph Model

TL;DR: ? The paper introduces the Exploratory Graph Model (EGM), a new mathematical framework for analyzing…
Generative Psychometrics via AI-GENIE: Automatic Item Generation and Validation via Network-Integrated Evaluation.

2024年9月12日

Generative Psychometrics via AI-GENIE: Automatic Item Generation and Validation via Network-Integrated Evaluation.

Do you like AI/LLMs and Network Psychometrics? Want to know how to generate AND validate items in silica automatically?…

13 条评论
Is the Brain Ergodic? From Group Averages to Individuals: A Revolution in Psych and Neuroscience

2024年7月31日

Is the Brain Ergodic? From Group Averages to Individuals: A Revolution in Psych and Neuroscience

For over a century, psychologists have primarily focused on studying differences between people. But what if this…
Revised paper: "Towards a psychology of individuals: The ergodicity information index and a bottom-up approach for finding generalizations"

2024年7月29日

Revised paper: "Towards a psychology of individuals: The ergodicity information index and a bottom-up approach for finding generalizations"

Revised paper: "Towards a psychology of individuals: The ergodicity information index and a bottom-up approach for…
Compute GenTEFI using the EGAnet Package

2024年7月13日

Compute GenTEFI using the EGAnet Package

Do you want to compare a correlated traits structure with a bifactor structure, but do you know that traditional fit…
Attachment Styles, CFA, and Fit Indices

2024年7月12日

Attachment Styles, CFA, and Fit Indices

Did you know people can have different styles of attachment in their relationships? Do you like Network Psychometrics…
Understanding the Generalized TEFI index

2024年7月11日

Understanding the Generalized TEFI index

Do you like Network Psychometrics, Structural Equation Modeling, & Confirmatory Factor Analysis? Have you ever wondered…
Generalized Total Entropy Fit Index: A new fit index to compare bifactor and correlated factor structures in SEM and network psychometrics

2024年7月10日

Generalized Total Entropy Fit Index: A new fit index to compare bifactor and correlated factor structures in SEM and network psychometrics

New Pre-Print: "Generalized Total Entropy Fit Index: A new fit index to compare bifactor and correlated factor…

See all articles

Automatic Item Development and Validation using LLMs via AI-GENIE: In Silica and Human Results

Hudson Golino

Associate Professor of Quantitative Methods at the Department of Psychology, University of Virginia

领英推荐

Hudson Golino的更多文章

社区洞察

其他会员也浏览了

Transform the Way You Set Goals in 2025

World Health Day: 3 use cases of AI helping science

Adverse Event Reporting in Healthcare Market Research: Four Ways to Leverage Artificial Intelligence

How Electronic Health Records Impact Patient Engagement and Satisfaction

Decoding Neurological Conditions through AI's Lens: An Artistic Exploration

Shaping Tomorrow: How Artificial Intelligence Will Transform the World

Have Your Girl Call My Girl and We’ll Do Lunch*

The Impact of AI on Strategic Thinking and Planning: Transforming Business Blueprints

Unravelling AI Cognitive Biases

Beyond the Algorithm: Empowering Human Intellect in the AI Era

领英推荐

Hudson Golino的更多文章

Optimizing LLM Embeddings for Clustering

The Exploratory Graph Model

Generative Psychometrics via AI-GENIE: Automatic Item Generation and Validation via Network-Integrated Evaluation.

Is the Brain Ergodic? From Group Averages to Individuals: A Revolution in Psych and Neuroscience

Revised paper: "Towards a psychology of individuals: The ergodicity information index and a bottom-up approach for finding generalizations"

Compute GenTEFI using the EGAnet Package

Attachment Styles, CFA, and Fit Indices

Understanding the Generalized TEFI index

Generalized Total Entropy Fit Index: A new fit index to compare bifactor and correlated factor structures in SEM and network psychometrics

社区洞察

其他会员也浏览了

Transform the Way You Set Goals in 2025

World Health Day: 3 use cases of AI helping science

Adverse Event Reporting in Healthcare Market Research: Four Ways to Leverage Artificial Intelligence

How Electronic Health Records Impact Patient Engagement and Satisfaction

Decoding Neurological Conditions through AI's Lens: An Artistic Exploration

Shaping Tomorrow: How Artificial Intelligence Will Transform the World

Have Your Girl Call My Girl and We’ll Do Lunch*

The Impact of AI on Strategic Thinking and Planning: Transforming Business Blueprints

Unravelling AI Cognitive Biases

Beyond the Algorithm: Empowering Human Intellect in the AI Era