Novel LLM Benchmarking methodologies
Note: I haven't ready deeply into LLM benchmarks and how they change / evolve and some parts of this post below might be redundant. This is just a though experiment as I am reading and working through the material that is out there on LLM benchmarks.
I've always been intrigued by LLM Benchmarks. Each time a new model comes out, there are a bunch of benchmark results and a claim how the model is the state-of-the-art and better than the existing models out there that perform poorer on the benchmark scores.
Recently I came across https://livebench.ai/ and it's corresponding paper at https://livebench.ai/livebench.pdf
Firstly, the paper is a very easy read, but more importantly it gives some very nice insights into the various categories of LLM benchmarks and talks about how they are designed and administered.
LLMs the Examiner
Of note is the fact that LLMs themselves are involved in designing some of the questions and even helps in the evaluation of the results. While this is definitely interesting and is a great way to make use of LLMs in order to improve LLMs, there are some inherent issues.
Firstly it looks like most of the questions where designed by LLMs are designed by the GPT-x models. This has a high chance of bias - firstly the thought process of GPT models can most easily be matched by other GPT models, as compared to other families of models. Secondly, a new model will likely be as good or better than the current state-of-the-art that is designing the questions in the first place. This means that the newer models will most likely ace the results. The paper describes this too as most models get results in the upper 90% mark.
There is also the fact that any newer model probably has been trained to ace the benchmarks and hence is able to answer the benchmarks questions well and is able to position itself close to the top of the leaderboard, if not at the very top.
Hallucinating Evaluators?
On the other hand, there is also this interesting conundrum of LLMs judging the results when they themselves are necessarily perfect and tend to hallucinate. How can this LLM evaluate another LLM and be 100% confident in it's results? This feels very wrong.
领英推荐
Students and Professors
Let's go to the analogy of students taking an examination. A professor (who is definitely a lot better studied and likely among the best in the field) sets the questions for the students. The problems have a valid and known solution and the exam papers are compared to the answer key - with some intelligence thrown in. For another subject, there is a different professor that is the expert in that area and set's the papers for that subject.
This methodology could also work well for LLM benchmarking, is it not? Let's start with a baseline benchmark. The benchmarks typically have some 16-32 categories with certain models acing each benchmark category. Now let this top of the rank model set up the questions for the rest of the LLMs. Similarly, each category topper sets up the questions for that category and the other LLMs compete against this LLM. If the same LLM as the one that set up the question doesn't do as well, then this could be another (reflective?) benchmark category.
If any model manages to beat the category benchmark (100%?) then it can now set up the questions for the current or future models. This should also be a continuous scoring system where the top models periodically set up new questions and the existing and newer models have a chance to topple the leaderboard.
There has to be some negative weight if a model cannot answer questions set by the same model and also some stopping conditions if there are frequent oscillations or a set of models just swapping places continuously.
Does anyone know if any research has been done on these lines? Are there any benchmarks that behave this way? I love to read about it.
If nothing like this exists, I will try to build such a benchmarking concept and share my findings in a future post.
Student at IIT Patna | Project Trainee (Intern'25) @ISRO | Research Intern'24 @IIIT Guwahati | NJACK WoC'23 | Prompts Engg
8 个月Great advice!