Testing AI the Human Way: Misguided or Revealing?
An interesting article in MIT Technology Review discusses how large language models like GPT-3 and GPT-4 are often tested using exams and IQ tests designed for humans. However, the author argues that acing these tests does not necessarily mean these AI systems possess human-like intelligence or capabilities. While the results are impressive, we cannot interpret them in the same way as with humans, since we don't fully understand the inner workings of these complex statistical models. More rigorous, controlled testing methods are needed to truly evaluate the strengths and limitations of this technology. Rather than getting distracted by test scores, researchers should focus more on investigating the mechanisms driving the models' behaviors. The article provides an important reminder that we need to avoid anthropomorphizing when evaluating artificial intelligence:
Introduction
The text discusses how large language models like GPT-3 and GPT-4 are being evaluated using tests designed for humans, and argues this approach is problematic. The central thesis is that passing human tests does not necessarily mean these AI systems possess human-like intelligence or capabilities. More rigorous evaluation methods are needed.
Models can pass certain reasoning tests
"What Webb’s research highlights is only the latest in a long string of remarkable tricks pulled off by large language models. For example, when OpenAI unveiled GPT-3’s successor, GPT-4, in March, the company published an eye-popping list of professional and academic assessments that it claimed its new large language model had aced." This shows these models can score well on tests of skills like analogical reasoning. However, as the next section explains, such results do not necessarily demonstrate real intelligence.
Results are open to interpretation
"But when a large language model scores well on such tests, it is not clear at all what has been measured. Is it evidence of actual understanding? A mindless statistical trick? Rote repetition?" Unlike with humans, when AIs perform well on cognitive tests, it is unclear what drives the results. They could be "memorizing" answers after seeing related examples in training data rather than exhibiting intelligence.
Performance is inconsistent
"The performance of large language models is brittle. Among people, it is safe to assume that someone who scores well on a test would also do well on a similar test. That’s not the case with large language models: a small tweak to a test can drop an A grade to an F." Unlike humans, small changes to tests can dramatically affect AIs' scores, indicating a lack of robust reasoning skills.
领英推荐
Anthropomorphizing harms evaluation
"The assumption that cognitive or academic tests designed for humans serve as accurate measures of LLM capability stems from a tendency to anthropomorphize models and align their evaluation with human standards," says Shapira. "This assumption is misguided." Researchers warn against seeing human-like cognition where it may not exist. More rigorous, controlled experiments are needed, avoiding built-in assumptions.
Moving goalposts or rethinking intelligence?
"It comes down to how large language models do what they do. Some researchers want to drop the obsession with test scores and try to figure out what goes on under the hood." Rather than dismissing failures as "moving the goalposts," many argue research should focus on understanding the mechanisms driving model behaviors before making claims about intelligence.
Conclusion
In summary, while large language models can score impressively on some human tests, passing such tests does not demonstrate true intelligence or cognition. More careful, rigorous evaluation methods are required to understand these systems' capabilities and limitations. The text argues research should investigate how models function rather than making anthropomorphic assumptions.
\mit
"Large language models aren’t people. Let’s stop testing them as if they were."
*[this newsletter is produced with the help of GPT-4 & Claude]
Communication Coach | Leadership Coach | Career Coach | Productivity Coach | Relationship Coach| Executive Coach
1 年While test scores may be impressive, it is crucial to remember that artificial intelligence models like GPT-3 and GPT-4 operate differently than humans. Understanding the inner workings of these complex statistical models is essential in evaluating their true capabilities. Instead of solely focusing on test results, researchers should prioritize studying the mechanisms behind the behaviors displayed by these AI systems. This article serves as a valuable reminder to avoid anthropomorphizing AI and to employ more controlled testing methods to assess their strengths and limitations. Roni Chittoni
Global Business and Partnership Executive | CDO Program MIT | Digital Transformation
1 年I feel it’s too late… Ignorance is a bless sometimes and most part of of people don’t have idea it’s a statistic model. I see lots of average people already having ChatGPT as their day to day friend… Talking to Chat GPT like it’s human… Hard not to relate one to another…
Head of Design at ElifTech
1 年Thanks for sharing! That's a definitely thought-provoking issue.
Generative AI Specialist | WatsonX SME | Data & AI Solution Architect | Business-Driven Technologist | Developer Advocate | Speaker | Teacher | Restless | Rider | Roller Skater
1 年IQ definitively is not the way to measure human inteligent. According to Gardner there are 9 types, which I totaly agree: https://www.nordangliaeducation.com/pbis-prague/news/2020/12/09/the-nine-types-of-intelligence
Gerente de Sistemas Backoffice, IT Manager SAP e Professor Universitário
1 年I wonderrg