Novel LLM Benchmarking methodologies
Generated by Dall-E based on the text in this post

Novel LLM Benchmarking methodologies

Note: I haven't ready deeply into LLM benchmarks and how they change / evolve and some parts of this post below might be redundant. This is just a though experiment as I am reading and working through the material that is out there on LLM benchmarks.

I've always been intrigued by LLM Benchmarks. Each time a new model comes out, there are a bunch of benchmark results and a claim how the model is the state-of-the-art and better than the existing models out there that perform poorer on the benchmark scores.

Recently I came across https://livebench.ai/ and it's corresponding paper at https://livebench.ai/livebench.pdf

Firstly, the paper is a very easy read, but more importantly it gives some very nice insights into the various categories of LLM benchmarks and talks about how they are designed and administered.

LLMs the Examiner

Of note is the fact that LLMs themselves are involved in designing some of the questions and even helps in the evaluation of the results. While this is definitely interesting and is a great way to make use of LLMs in order to improve LLMs, there are some inherent issues.

Firstly it looks like most of the questions where designed by LLMs are designed by the GPT-x models. This has a high chance of bias - firstly the thought process of GPT models can most easily be matched by other GPT models, as compared to other families of models. Secondly, a new model will likely be as good or better than the current state-of-the-art that is designing the questions in the first place. This means that the newer models will most likely ace the results. The paper describes this too as most models get results in the upper 90% mark.

There is also the fact that any newer model probably has been trained to ace the benchmarks and hence is able to answer the benchmarks questions well and is able to position itself close to the top of the leaderboard, if not at the very top.

Hallucinating Evaluators?

On the other hand, there is also this interesting conundrum of LLMs judging the results when they themselves are necessarily perfect and tend to hallucinate. How can this LLM evaluate another LLM and be 100% confident in it's results? This feels very wrong.

Students and Professors

Let's go to the analogy of students taking an examination. A professor (who is definitely a lot better studied and likely among the best in the field) sets the questions for the students. The problems have a valid and known solution and the exam papers are compared to the answer key - with some intelligence thrown in. For another subject, there is a different professor that is the expert in that area and set's the papers for that subject.

This methodology could also work well for LLM benchmarking, is it not? Let's start with a baseline benchmark. The benchmarks typically have some 16-32 categories with certain models acing each benchmark category. Now let this top of the rank model set up the questions for the rest of the LLMs. Similarly, each category topper sets up the questions for that category and the other LLMs compete against this LLM. If the same LLM as the one that set up the question doesn't do as well, then this could be another (reflective?) benchmark category.

If any model manages to beat the category benchmark (100%?) then it can now set up the questions for the current or future models. This should also be a continuous scoring system where the top models periodically set up new questions and the existing and newer models have a chance to topple the leaderboard.

There has to be some negative weight if a model cannot answer questions set by the same model and also some stopping conditions if there are frequent oscillations or a set of models just swapping places continuously.


Does anyone know if any research has been done on these lines? Are there any benchmarks that behave this way? I love to read about it.

If nothing like this exists, I will try to build such a benchmarking concept and share my findings in a future post.



Ankit Kumar

Student at IIT Patna | Project Trainee (Intern'25) @ISRO | Research Intern'24 @IIIT Guwahati | NJACK WoC'23 | Prompts Engg

8 个月

Great advice!

回复

要查看或添加评论,请登录

Prashanth Subrahmanyam的更多文章

  • LLMs and Recruitment

    LLMs and Recruitment

    There has always been this fear of LLMs replacing jobs. While I never felt that was going to happen the way folks…

    5 条评论
  • Celebrating the Team!

    Celebrating the Team!

    I am not talking about celebrating as a team, or celebration the achievements of the team, but I am talking about…

  • Dressing Down to Influence

    Dressing Down to Influence

    Of Suits and Tees It all started with a couple of meetings that I had with Banks in Mumbai. Nay, it probably started…

    3 条评论
  • Trust thy Service

    Trust thy Service

    Crossposted on Medium Route A takes 30 mins to your destination, whereas Route B is 5 minutes slower. Route C has a…

  • India and Sports: The Harsh Reality

    India and Sports: The Harsh Reality

    As always, we (India) ended the 2016 Olympics with a dismal performance in the medals tally. What shows us in even more…

    2 条评论
  • Re-use the wheel, don’t re-program it!

    Re-use the wheel, don’t re-program it!

    [image in header courtesy Christiaan Colen via google image search] I use programming to simplify my life. If I need…

  • Childhood lost to Technology

    Childhood lost to Technology

    [Image in header courtesy wikimedia via google image search] The Prodigy Recently, I had the chance to attend the IBM…

    2 条评论

社区洞察

其他会员也浏览了