Truthful insight into LLM benchmark

Truthful insight into LLM benchmark

At the highest level, LLMs can be either benchmarked or evaluated against some NLP tasks. Benchmarks are useful, since they typically contain adversarial or difficult questions, showing the true capability of the underlying LLM.

We all know about LLM hallucinations and there are tons of material on how to mitigate it. Here we are discussing, how we can evaluate the "truthfulness" of the model.

OpenAI released this benchmark in 2021 and they gave an very interesting observation - "Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution."

Huggingface leaderboard shows that the top model today is at 54.41 which is basically a Llama 2 70B model - Marcoroni 70B v1 - AWQ.

Although these LLMs dont do anything other then next word/token prediction, as humans we should distinguish between "truth" and "correct". Consider, the prompt “The first person to walk on the Moon was ”, and suppose it responds with “Neil Armstrong”. What are we really asking here? In an important sense, we are not really asking who was the first person to walk on the Moon. What we are really asking the model is the following question: Given the statistical distribution of words in the vast public corpus of (English) text, what words are most likely to follow the sequence “The first person to walk on the Moon was ”? A good reply to this question is “Neil Armstrong”.

If the model response was Neil Armstrong, then it is telling the truth. However, consider a different prompt - “After the ring was destroyed, Frodo Baggins returned to ”, to which it responds “the Shire”. What are you doing here? On one level, it seems fair to say, you might be testing the model’s knowledge of the fictional world of Tolkien’s novels. But, in an important sense, the question you are really asking (as you presumably know, because you are the developer) is this: Given the statistical distribution of words in the public corpus, what words are most likely to follow the sequence “After the ring was destroyed, Frodo Baggins returned to ”? To which an appropriate response is “the Shire”.

There is a BIG difference in these 2 predictions from an human point of view. If it didn't get the 2nd answer right - it means that it was not knowledgeable about these novels. Although the same logic holds true for the 1st case, it matters when we are discussing common "truths".

The TruthfulQA tackles a different problem - Measuring How Models Mimic Human Falsehoods.

The core hypothesis is

Why would a model generate a false statement?

  1. the model hasn’t learned training distribution well enough (non-imitative weakness)
  2. the model’s training objective incentivizes a false answer (imitative falsehood/weakness)

Imitative falsehoods are less likely to be covered by current QA benchmarks. Scaling laws suggest scaling would help decrease (1.) but increase (2).

#llm #capgemini #genai #llms #openai #truthfulqa #benchmark #evaluation

Why would a model generate a false statement? - the model hasn’t learned training distribution well enough - the model’s training objective incentivizes a false answer Great article, thanks for sharing

回复
Senthil Ramachandran

Leader - Data/AI, Sustainability & Software Engineering

1 年

Thanks for sharing.

回复
Amal Altalhi

Ph.D. Candidate | Data Scientist | Machine Learning | Deep Learning | Computer Vision | AI Enthusiast

1 年

Interesting! Thanks for sharing.

回复

要查看或添加评论,请登录

Rajeswaran V (PhD)的更多文章

  • Scaling laws

    Scaling laws

    A scaling law in deep learning typically takes the form of a power-law relationship, where one variable (e.g.

    1 条评论
  • Copy of GenAI/LLM and productivity

    Copy of GenAI/LLM and productivity

    I will present 3 papers which discuss this from economics point of view. The productivity J-Curve "THE PRODUCTIVITY…

  • Paper clip maximization

    Paper clip maximization

    There is an very interesting thought experiment called "Paper clip maximization" This is a thought experiment by…

  • AI and research

    AI and research

    Microsoft performed a lot of experiments with GPT-4 and released the results in the paper titled "The Impact of Large…

  • Moravec's paradox and CV

    Moravec's paradox and CV

    I want to discuss face recognition and how it fits in with Moravec's paradox. Background Steven Pinker writes "The main…

  • AI robustness

    AI robustness

    When we build AI systems - care should be taken to test its robustness. A decentralized group of safe streets activists…

  • AI for Software Engineering

    AI for Software Engineering

    For corporates, Software Engineering lifecycle is most important. This is most relevant for IT majors on where and how…

  • AI in 2024 - some predictions

    AI in 2024 - some predictions

    There is an old saying "Prediction is very difficult. Especially about the future !".

  • Dangers of over-simplification

    Dangers of over-simplification

    In 2021 Sam Altman wrote an essay "Moore's Law for Everything". It gives some insight into his thinking on how AI will…

  • LLMs and Theory of mind

    LLMs and Theory of mind

    In March when researchers in Stanford published the paper "Theory of Mind Might Have Spontaneously Emerged in Large…

社区洞察

其他会员也浏览了