登录查看更多内容

Testing AI the Human Way: Misguided or Revealing?

Roni Chittoni

Associate Partner at Verox

发布日期: 2023年9月11日

An interesting article in MIT Technology Review discusses how large language models like GPT-3 and GPT-4 are often tested using exams and IQ tests designed for humans. However, the author argues that acing these tests does not necessarily mean these AI systems possess human-like intelligence or capabilities. While the results are impressive, we cannot interpret them in the same way as with humans, since we don't fully understand the inner workings of these complex statistical models. More rigorous, controlled testing methods are needed to truly evaluate the strengths and limitations of this technology. Rather than getting distracted by test scores, researchers should focus more on investigating the mechanisms driving the models' behaviors. The article provides an important reminder that we need to avoid anthropomorphizing when evaluating artificial intelligence:

Introduction

The text discusses how large language models like GPT-3 and GPT-4 are being evaluated using tests designed for humans, and argues this approach is problematic. The central thesis is that passing human tests does not necessarily mean these AI systems possess human-like intelligence or capabilities. More rigorous evaluation methods are needed.

Models can pass certain reasoning tests

"What Webb’s research highlights is only the latest in a long string of remarkable tricks pulled off by large language models. For example, when OpenAI unveiled GPT-3’s successor, GPT-4, in March, the company published an eye-popping list of professional and academic assessments that it claimed its new large language model had aced." This shows these models can score well on tests of skills like analogical reasoning. However, as the next section explains, such results do not necessarily demonstrate real intelligence.

Results are open to interpretation

"But when a large language model scores well on such tests, it is not clear at all what has been measured. Is it evidence of actual understanding? A mindless statistical trick? Rote repetition?" Unlike with humans, when AIs perform well on cognitive tests, it is unclear what drives the results. They could be "memorizing" answers after seeing related examples in training data rather than exhibiting intelligence.

Performance is inconsistent

"The performance of large language models is brittle. Among people, it is safe to assume that someone who scores well on a test would also do well on a similar test. That’s not the case with large language models: a small tweak to a test can drop an A grade to an F." Unlike humans, small changes to tests can dramatically affect AIs' scores, indicating a lack of robust reasoning skills.

Daniel Covarrubias 1 年前

DS1 : A Brief Study On Llama

Venkates Nagineni 6 个月前

From Hallucinations to Clarity: Steering the Future of…

Nayeem Islam 11 个月前

Anthropomorphizing harms evaluation

"The assumption that cognitive or academic tests designed for humans serve as accurate measures of LLM capability stems from a tendency to anthropomorphize models and align their evaluation with human standards," says Shapira. "This assumption is misguided." Researchers warn against seeing human-like cognition where it may not exist. More rigorous, controlled experiments are needed, avoiding built-in assumptions.

Moving goalposts or rethinking intelligence?

"It comes down to how large language models do what they do. Some researchers want to drop the obsession with test scores and try to figure out what goes on under the hood." Rather than dismissing failures as "moving the goalposts," many argue research should focus on understanding the mechanisms driving model behaviors before making claims about intelligence.

Conclusion

In summary, while large language models can score impressively on some human tests, passing such tests does not demonstrate true intelligence or cognition. More careful, rigorous evaluation methods are required to understand these systems' capabilities and limitations. The text argues research should investigate how models function rather than making anthropomorphic assumptions.

\mit

"Large language models aren’t people. Let’s stop testing them as if they were."

*[this newsletter is produced with the help of GPT-4 & Claude]

IA Lens

9,893 位关注者

Bijay Kumar Khandal

1 年

While test scores may be impressive, it is crucial to remember that artificial intelligence models like GPT-3 and GPT-4 operate differently than humans. Understanding the inner workings of these complex statistical models is essential in evaluating their true capabilities. Instead of solely focusing on test results, researchers should prioritize studying the mechanisms behind the behaviors displayed by these AI systems. This article serves as a valuable reminder to avoid anthropomorphizing AI and to employ more controlled testing methods to assess their strengths and limitations. Roni Chittoni

2 次回应

Ricardo Minari

Global Business and Partnership Executive | CDO Program MIT | Digital Transformation

1 年

I feel it’s too late… Ignorance is a bless sometimes and most part of of people don’t have idea it’s a statistic model. I see lots of average people already having ChatGPT as their day to day friend… Talking to Chat GPT like it’s human… Hard not to relate one to another…

1 次回应

Alla Vyelihina

Head of Design at ElifTech

1 年

Thanks for sharing! That's a definitely thought-provoking issue.

1 次回应

Sergio Gama

1 年

IQ definitively is not the way to measure human inteligent. According to Gardner there are 9 types, which I totaly agree: https://www.nordangliaeducation.com/pbis-prague/news/2020/12/09/the-nine-types-of-intelligence

3 次回应

Marlon Prata, MSc

Gerente de Sistemas Backoffice, IT Manager SAP e Professor Universitário

1 年

I wonderrg

查看更多评论

要查看或添加评论，请登录

查看全部

Testing AI the Human Way: Misguided or Revealing?

Roni Chittoni

Associate Partner at Verox

领英推荐

IA Lens

9,893 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Unleashing the Power of GPT-4: What to Expect from the Next Generation of AI Language Models

Large Language Models and Chatbots: A Journey from Complex to Simple

Unravelling the Realities of AI: A Dive into GPT Language Model

Filling the Gap: The Next Frontier for GenAI.

What is RAG and how will it impact your adoption of AI?

AI Update NOW: A Week of Mind-Blowing Breakthroughs in Artificial Intelligence

Is GPT-4 the key to unlocking AGI or another step in the journey?

The Dawn of a New Era in AI: Foundational Models and Large Language Models

The nature of large language models like GPT-4: simulation or true intelligence?

The Rise of Large Language Models (LLMs) and Their Impact on AI Development

领英推荐

IA Lens

9,893 位关注者

AI's Big Impact on Work

2024年9月17日

AI is Changing Everything

2024年9月12日

AI: The Good, the Bad, and the Uncertain

2024年9月10日

Data Privacy in the Age of LLM

2024年9月7日

Generative AI: Reshaping Work & Knowledge

2024年9月5日

AI's Transformative Impact on Education

2024年9月2日

Digisexuality

2024年9月1日

The Future of Retail

2024年8月31日

Liderar na era da IA

2024年4月12日

Leading in the AI Age

2024年4月12日

社区洞察

其他会员也浏览了

Unleashing the Power of GPT-4: What to Expect from the Next Generation of AI Language Models

Large Language Models and Chatbots: A Journey from Complex to Simple

Unravelling the Realities of AI: A Dive into GPT Language Model

Filling the Gap: The Next Frontier for GenAI.

What is RAG and how will it impact your adoption of AI?

AI Update NOW: A Week of Mind-Blowing Breakthroughs in Artificial Intelligence

Is GPT-4 the key to unlocking AGI or another step in the journey?

The Dawn of a New Era in AI: Foundational Models and Large Language Models

The nature of large language models like GPT-4: simulation or true intelligence?

The Rise of Large Language Models (LLMs) and Their Impact on AI Development