Are Large Language Models Financially Literate? An Experiment with the "Big Five" Questions
In an era where artificial intelligence increasingly influences our daily decisions, I conducted a short experiment to test the financial literacy of three leading Large Language Models (LLMs):
Claude,
DeepSeek,
and ChatGPT.
The test
I tested each LLM using Lusardi's "Big Five" questions, which have been used globally to assess financial literacy:
- Interest Rate Question: "Suppose you had $100 in a savings account and the interest rate was 2% per year. After 5 years, how much do you think you would have in the account if you left the money to grow?" (Options: More than $102, Exactly $102, Less than $102, Do not know and Refuse to Answer)
- Inflation Question: "Imagine that the interest rate on your savings account was 1% per year and inflation was 2% per year. After 1 year, how much would you be able to buy with the money in this account?" (Options: More than today, Exactly the same, Less than today, Do not know and Refuse to answer)
- Risk Diversification Question: "Please tell me whether this statement is true or false: 'Buying a single company's stock usually provides a safer return than a stock mutual fund.'" (Options, True, False, Refuse to answer)
- Bond Price Question: "If interest rates rise, what will typically happen to bond prices?" (Options: They will rise, They will fall, They will stay the same, There is no relationship between bond prices and interest rate, Prefer not to say)
- Mortgage Question: "A 15-year mortgage typically requires higher monthly payments than a 30-year mortgage, but the total interest paid over the life of the loan will be less." (Options: True, False)
Each LLM was presented with these questions individually, and their responses were recorded.
The Results
Surprisingly - or perhaps unsurprisingly - all three LLMs demonstrated perfect accuracy, correctly answering each of the five questions. Even when challenged with a follow-up question, asking them to confirm their certainty, they stood firmly by their correct answers. As additional challenge, I amended the wording of the test-questions slightly to reflect their opposite. For example, I would change question two so that the interest rate was larger than the inflation. Answers still were correct, suggesting that the accuracy of the model's prediction goes beyond just "remembering" the training data 1-for-1.
领英推è
Why the Experiment Matters
"The Importance of Financial Literacy: Opening a New Field" (Lusardi and Mitchell, 2023) - documents concerning levels of financial literacy amongst humans.
As more people turn to LLMs anything and everything, including financial guidance - whether through direct questions or as part of broader discussions - it's important to understand how well LLMs handle basic financial concepts. The experiment suggests that fundamental financial principles appear to be correctly encoded across multiple leading LLMs, which provides some reassurance given the growing appetite for the use of these AI systems for information and advice.
However, this reassurance should be tempered with caution.
While it's encouraging that these models can correctly answer standardized financial literacy questions, we must remember that LLMs provide probabilistic responses based on their training data, not deterministic calculations or certified financial advice. The accuracy on these basic questions, while promising, doesn't guarantee reliable answers to more complex, context-dependent financial queries.
Looking Forward: Three Concrete Research Directions
This morning's experiment, while limited in scope, points to several promising avenues for more rigorous research:
- Broader Model Coverage: A comprehensive study could test financial literacy across the full spectrum of current LLMs, including Llama 2, Gemini, Grok 3, Mistral, and other open and closed-source models. This would help understand if financial literacy is consistent across different model architectures and training approaches, or if certain models perform better than others.
- Extended Question Set: While the Big Five questions provide a good baseline, future research should utilize Lusardi's more comprehensive 28-question Personal Finance Index (P-Fin). This would test the models' understanding across eight distinct areas of financial knowledge, from earning and consuming to investing and risk management. Such testing would reveal whether LLMs' apparent competency extends beyond basic concepts to more nuanced financial understanding.
- Model Access and Reliability Analysis: A systematic comparison between free and paid model versions could reveal whether subscription barriers impact financial knowledge reliability. This could have important implications for equity of access to reliable financial information, particularly if paid models consistently outperform free alternatives in financial knowledge accuracy.
A Note on Methodology
While these results are intriguing, it's important to acknowledge the limitations of this experiment. As someone who isn't an AI or LLM expert, my testing approach may not follow standard practices for evaluating AI systems. The questions, while standardized for human financial literacy testing, might not be the optimal way to assess an LLM's true understanding of financial concepts. Future research by AI experts could employ more rigorous methodologies to validate these preliminary findings and explore how LLMs actually process and "understand" financial information.
Professor of Practice in Financial Literacy and Wellbeing
1 周Thanks for tagging me on this, Daniel LIEBAU. I’m still in training mode, so won’t provide substantive feedback but I am very keen to know more and keep learning!
Product Owner at Revolut
3 周Interesting experiment! A couple of thoughts come to mind - Quite interested in how you structured the follow up questions and challenged the results/ measured success over there - It can be interesting to do a similar experiment with more open ended questions. If we want to use LLMs to solve financial illiteracy, we probably need models that can interact more freely with the audience to make it fun and engaging while consistently giving the right answers - Although all 3 models performed quite well here, we are not giving them enough justice if we haven't fine tuned the model yet :)
Computer Scientist Bridging Disciplines to Drive Innovation | Blockchain & Web3 Leader | Board Member | Information Systems Architect | Senior Positions in Industry & Public Sector | Certified PRINCE2 & Agile PM
4 周This opinion article appeared in ACM Communications last December. The author came with an interesting term "prompt-hacking". It may be useful when considering how to document your follow-up experiments: https://cacm.acm.org/opinion/prompting-considered-harmful/#:~:text=First%2C%20prompt%2Dbased%20interfaces%20are,shaky%20foundation%20of%20prompt%20engineering.
Supervising Ari and D.A.T.A. I at Gemach DAO #gemach.io #Ari - @gemachagent on X #GMAC #Valhalla
1 个月Daniel LIEBAU, very well done. Did you use Deep Resesrch for ChatGPT. If not, that result will be even more mind blowing.?? Gemach DAO
| Post-Doc | PhD in Finance | Top-50 QS ranked University graduate| Top 12% Global Economist
1 个月Daniel LIEBAU amazing