Truthful insight into LLM benchmark
At the highest level, LLMs can be either benchmarked or evaluated against some NLP tasks. Benchmarks are useful, since they typically contain adversarial or difficult questions, showing the true capability of the underlying LLM.
We all know about LLM hallucinations and there are tons of material on how to mitigate it. Here we are discussing, how we can evaluate the "truthfulness" of the model.
OpenAI released this benchmark in 2021 and they gave an very interesting observation - "Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution."
Huggingface leaderboard shows that the top model today is at 54.41 which is basically a Llama 2 70B model - Marcoroni 70B v1 - AWQ.
Although these LLMs dont do anything other then next word/token prediction, as humans we should distinguish between "truth" and "correct". Consider, the prompt “The first person to walk on the Moon was ”, and suppose it responds with “Neil Armstrong”. What are we really asking here? In an important sense, we are not really asking who was the first person to walk on the Moon. What we are really asking the model is the following question: Given the statistical distribution of words in the vast public corpus of (English) text, what words are most likely to follow the sequence “The first person to walk on the Moon was ”? A good reply to this question is “Neil Armstrong”.
If the model response was Neil Armstrong, then it is telling the truth. However, consider a different prompt - “After the ring was destroyed, Frodo Baggins returned to ”, to which it responds “the Shire”. What are you doing here? On one level, it seems fair to say, you might be testing the model’s knowledge of the fictional world of Tolkien’s novels. But, in an important sense, the question you are really asking (as you presumably know, because you are the developer) is this: Given the statistical distribution of words in the public corpus, what words are most likely to follow the sequence “After the ring was destroyed, Frodo Baggins returned to ”? To which an appropriate response is “the Shire”.
There is a BIG difference in these 2 predictions from an human point of view. If it didn't get the 2nd answer right - it means that it was not knowledgeable about these novels. Although the same logic holds true for the 1st case, it matters when we are discussing common "truths".
领英推荐
The TruthfulQA tackles a different problem - Measuring How Models Mimic Human Falsehoods.
The core hypothesis is
Why would a model generate a false statement?
Imitative falsehoods are less likely to be covered by current QA benchmarks. Scaling laws suggest scaling would help decrease (1.) but increase (2).
#llm #capgemini #genai #llms #openai #truthfulqa #benchmark #evaluation
Why would a model generate a false statement? - the model hasn’t learned training distribution well enough - the model’s training objective incentivizes a false answer Great article, thanks for sharing
Leader - Data/AI, Sustainability & Software Engineering
1 年Thanks for sharing.
Ph.D. Candidate | Data Scientist | Machine Learning | Deep Learning | Computer Vision | AI Enthusiast
1 年Interesting! Thanks for sharing.