登录查看更多内容

Truthful insight into LLM benchmark

Rajeswaran V (PhD)

Generative AI specialist. AI Futures and AI CoE head

发布日期: 2023年10月2日

At the highest level, LLMs can be either benchmarked or evaluated against some NLP tasks. Benchmarks are useful, since they typically contain adversarial or difficult questions, showing the true capability of the underlying LLM.

We all know about LLM hallucinations and there are tons of material on how to mitigate it. Here we are discussing, how we can evaluate the "truthfulness" of the model.

OpenAI released this benchmark in 2021 and they gave an very interesting observation - "Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution."

Huggingface leaderboard shows that the top model today is at 54.41 which is basically a Llama 2 70B model - Marcoroni 70B v1 - AWQ.

Although these LLMs dont do anything other then next word/token prediction, as humans we should distinguish between "truth" and "correct". Consider, the prompt “The first person to walk on the Moon was ”, and suppose it responds with “Neil Armstrong”. What are we really asking here? In an important sense, we are not really asking who was the first person to walk on the Moon. What we are really asking the model is the following question: Given the statistical distribution of words in the vast public corpus of (English) text, what words are most likely to follow the sequence “The first person to walk on the Moon was ”? A good reply to this question is “Neil Armstrong”.

If the model response was Neil Armstrong, then it is telling the truth. However, consider a different prompt - “After the ring was destroyed, Frodo Baggins returned to ”, to which it responds “the Shire”. What are you doing here? On one level, it seems fair to say, you might be testing the model’s knowledge of the fictional world of Tolkien’s novels. But, in an important sense, the question you are really asking (as you presumably know, because you are the developer) is this: Given the statistical distribution of words in the public corpus, what words are most likely to follow the sequence “After the ring was destroyed, Frodo Baggins returned to ”? To which an appropriate response is “the Shire”.

There is a BIG difference in these 2 predictions from an human point of view. If it didn't get the 2nd answer right - it means that it was not knowledgeable about these novels. Although the same logic holds true for the 1st case, it matters when we are discussing common "truths".

领英推荐

What’s New in NLP? #5 Summarize Beta, Top NLP Papers…

Cohere 2 年前

Top LLM Papers of the week (July Week 4, 2024)

Kalyan KS 7 个月前

Large Language Models: From Prototype to Production

Ines Montani 1 年前

The TruthfulQA tackles a different problem - Measuring How Models Mimic Human Falsehoods.

The core hypothesis is

Why would a model generate a false statement?

the model hasn’t learned training distribution well enough (non-imitative weakness)
the model’s training objective incentivizes a false answer (imitative falsehood/weakness)

Imitative falsehoods are less likely to be covered by current QA benchmarks. Scaling laws suggest scaling would help decrease (1.) but increase (2).

#llm #capgemini #genai #llms #openai #truthfulqa #benchmark #evaluation

INTELLITHING

1 年

Why would a model generate a false statement? - the model hasn’t learned training distribution well enough - the model’s training objective incentivizes a false answer Great article, thanks for sharing

Senthil Ramachandran

Leader - Data/AI, Sustainability & Software Engineering

1 年

Thanks for sharing.

Amal Altalhi

1 年

Interesting! Thanks for sharing.

查看更多评论

要查看或添加评论，请登录

Rajeswaran V (PhD)的更多文章

Scaling laws

2024年2月4日

Scaling laws

A scaling law in deep learning typically takes the form of a power-law relationship, where one variable (e.g.

1 条评论
Copy of GenAI/LLM and productivity

2024年1月21日

Copy of GenAI/LLM and productivity

I will present 3 papers which discuss this from economics point of view. The productivity J-Curve "THE PRODUCTIVITY…
Paper clip maximization

2024年1月14日

Paper clip maximization

There is an very interesting thought experiment called "Paper clip maximization" This is a thought experiment by…
AI and research

2024年1月9日

AI and research

Microsoft performed a lot of experiments with GPT-4 and released the results in the paper titled "The Impact of Large…
Moravec's paradox and CV

2024年1月5日

Moravec's paradox and CV

I want to discuss face recognition and how it fits in with Moravec's paradox. Background Steven Pinker writes "The main…
AI robustness

2024年1月3日

AI robustness

When we build AI systems - care should be taken to test its robustness. A decentralized group of safe streets activists…
AI for Software Engineering

2024年1月2日

AI for Software Engineering

For corporates, Software Engineering lifecycle is most important. This is most relevant for IT majors on where and how…
AI in 2024 - some predictions

2024年1月1日

AI in 2024 - some predictions

There is an old saying "Prediction is very difficult. Especially about the future !".
Dangers of over-simplification

2023年12月31日

Dangers of over-simplification

In 2021 Sam Altman wrote an essay "Moore's Law for Everything". It gives some insight into his thinking on how AI will…
LLMs and Theory of mind

2023年12月31日

LLMs and Theory of mind

In March when researchers in Stanford published the paper "Theory of Mind Might Have Spontaneously Emerged in Large…

See all articles

Truthful insight into LLM benchmark

Rajeswaran V (PhD)

Generative AI specialist. AI Futures and AI CoE head

领英推荐

Rajeswaran V (PhD)的更多文章

社区洞察

其他会员也浏览了

Insights from ACL 2024 Bangkok: Advancing AI, LLMs and NLP

How Do Embeddings Help Reduce Hallucinations?

Power of Perceptual Positions

The NLP Meta Model

Unraveling the Mystery of Transformers: A Tech-Heavy Dive into NLP with Hugging Face

Data Science Case Study 2: NLP Complaint Classification

The Day Trajectory Data Met Language Models

On solving Language (and why we're not doing it in NLP)

History of NLP

Digital Transformation - NLP'22 Perspective

领英推荐

Rajeswaran V (PhD)的更多文章

Scaling laws

Copy of GenAI/LLM and productivity

Paper clip maximization

AI and research

Moravec's paradox and CV

AI robustness

AI for Software Engineering

AI in 2024 - some predictions

Dangers of over-simplification

LLMs and Theory of mind

社区洞察

其他会员也浏览了

Insights from ACL 2024 Bangkok: Advancing AI, LLMs and NLP

How Do Embeddings Help Reduce Hallucinations?

Power of Perceptual Positions

The NLP Meta Model

Unraveling the Mystery of Transformers: A Tech-Heavy Dive into NLP with Hugging Face

Data Science Case Study 2: NLP Complaint Classification

The Day Trajectory Data Met Language Models

On solving Language (and why we're not doing it in NLP)

History of NLP

Digital Transformation - NLP'22 Perspective