登录查看更多内容

Novel LLM Benchmarking methodologies

Prashanth Subrahmanyam

Head of Cloud Developer Advocacy - India, SEA, LATAM

发布日期: 2024年6月15日

Note: I haven't ready deeply into LLM benchmarks and how they change / evolve and some parts of this post below might be redundant. This is just a though experiment as I am reading and working through the material that is out there on LLM benchmarks.

I've always been intrigued by LLM Benchmarks. Each time a new model comes out, there are a bunch of benchmark results and a claim how the model is the state-of-the-art and better than the existing models out there that perform poorer on the benchmark scores.

Recently I came across https://livebench.ai/ and it's corresponding paper at https://livebench.ai/livebench.pdf

Firstly, the paper is a very easy read, but more importantly it gives some very nice insights into the various categories of LLM benchmarks and talks about how they are designed and administered.

LLMs the Examiner

Of note is the fact that LLMs themselves are involved in designing some of the questions and even helps in the evaluation of the results. While this is definitely interesting and is a great way to make use of LLMs in order to improve LLMs, there are some inherent issues.

Firstly it looks like most of the questions where designed by LLMs are designed by the GPT-x models. This has a high chance of bias - firstly the thought process of GPT models can most easily be matched by other GPT models, as compared to other families of models. Secondly, a new model will likely be as good or better than the current state-of-the-art that is designing the questions in the first place. This means that the newer models will most likely ace the results. The paper describes this too as most models get results in the upper 90% mark.

There is also the fact that any newer model probably has been trained to ace the benchmarks and hence is able to answer the benchmarks questions well and is able to position itself close to the top of the leaderboard, if not at the very top.

Hallucinating Evaluators?

On the other hand, there is also this interesting conundrum of LLMs judging the results when they themselves are necessarily perfect and tend to hallucinate. How can this LLM evaluate another LLM and be 100% confident in it's results? This feels very wrong.

领英推荐

TAI 112; Agent Capabilities Advancing; METR Eval and…

Towards AI 6 个月前

??Top ML Papers of the Week

DAIR.AI 6 个月前

AIFI - Artificial Intelligence Finance Institute…

AIFI - Artificial Intelligence Finance Institute 1 个月前

Students and Professors

Let's go to the analogy of students taking an examination. A professor (who is definitely a lot better studied and likely among the best in the field) sets the questions for the students. The problems have a valid and known solution and the exam papers are compared to the answer key - with some intelligence thrown in. For another subject, there is a different professor that is the expert in that area and set's the papers for that subject.

This methodology could also work well for LLM benchmarking, is it not? Let's start with a baseline benchmark. The benchmarks typically have some 16-32 categories with certain models acing each benchmark category. Now let this top of the rank model set up the questions for the rest of the LLMs. Similarly, each category topper sets up the questions for that category and the other LLMs compete against this LLM. If the same LLM as the one that set up the question doesn't do as well, then this could be another (reflective?) benchmark category.

If any model manages to beat the category benchmark (100%?) then it can now set up the questions for the current or future models. This should also be a continuous scoring system where the top models periodically set up new questions and the existing and newer models have a chance to topple the leaderboard.

There has to be some negative weight if a model cannot answer questions set by the same model and also some stopping conditions if there are frequent oscillations or a set of models just swapping places continuously.

Does anyone know if any research has been done on these lines? Are there any benchmarks that behave this way? I love to read about it.

If nothing like this exists, I will try to build such a benchmarking concept and share my findings in a future post.

Ankit Kumar

Student at IIT Patna | Project Trainee (Intern'25) @ISRO | Research Intern'24 @IIIT Guwahati | NJACK WoC'23 | Prompts Engg

8 个月

Great advice!

要查看或添加评论，请登录

Prashanth Subrahmanyam的更多文章

LLMs and Recruitment

2024年6月19日

LLMs and Recruitment

There has always been this fear of LLMs replacing jobs. While I never felt that was going to happen the way folks…

5 条评论
Celebrating the Team!

2019年9月13日

Celebrating the Team!

I am not talking about celebrating as a team, or celebration the achievements of the team, but I am talking about…
Dressing Down to Influence

2019年5月27日

Dressing Down to Influence

Of Suits and Tees It all started with a couple of meetings that I had with Banks in Mumbai. Nay, it probably started…

3 条评论
Trust thy Service

2016年11月4日

Trust thy Service

Crossposted on Medium Route A takes 30 mins to your destination, whereas Route B is 5 minutes slower. Route C has a…
India and Sports: The Harsh Reality

2016年8月24日

India and Sports: The Harsh Reality

As always, we (India) ended the 2016 Olympics with a dismal performance in the medals tally. What shows us in even more…

2 条评论
Re-use the wheel, don’t re-program it!

2016年6月29日

Re-use the wheel, don’t re-program it!

[image in header courtesy Christiaan Colen via google image search] I use programming to simplify my life. If I need…
Childhood lost to Technology

2016年6月29日

Childhood lost to Technology

[Image in header courtesy wikimedia via google image search] The Prodigy Recently, I had the chance to attend the IBM…

2 条评论

See all articles

Novel LLM Benchmarking methodologies

Prashanth Subrahmanyam

Head of Cloud Developer Advocacy - India, SEA, LATAM

LLMs the Examiner

Hallucinating Evaluators?

领英推荐

Students and Professors

Prashanth Subrahmanyam的更多文章

社区洞察

其他会员也浏览了

Memorization VS genuine reasoning in LLMs

Google DeepMind investigated inference scaling for long-context RAG

When Perfect Targets Go Bad: The Pitfalls of Over-Optimizing Our Measures

LLM Taming - Interesting insights from LMSYS Kaggle competition

?? Top LLM Papers of the Week (December Week 2, 2024)

Why and how often hallucination occurs in LLMs

Lessons for today from a 1957 film- Desk Set

I am a student at an International Business School. I am currently writing my thesis on the topic "The Role of Artificial

How to Summarize Legal Orders and Opinions Using the GPT-3.5

Embracing Complexity in Governance: Leveraging High-Dimensional Data and Computational Philosophy for Dynamic Policy-Making

LLMs the Examiner

Hallucinating Evaluators?

领英推荐

Students and Professors

Prashanth Subrahmanyam的更多文章

LLMs and Recruitment

Celebrating the Team!

Dressing Down to Influence

Trust thy Service

India and Sports: The Harsh Reality

Re-use the wheel, don’t re-program it!

Childhood lost to Technology

社区洞察

其他会员也浏览了

Memorization VS genuine reasoning in LLMs

Google DeepMind investigated inference scaling for long-context RAG

When Perfect Targets Go Bad: The Pitfalls of Over-Optimizing Our Measures

LLM Taming - Interesting insights from LMSYS Kaggle competition

?? Top LLM Papers of the Week (December Week 2, 2024)

Why and how often hallucination occurs in LLMs

Lessons for today from a 1957 film- Desk Set

I am a student at an International Business School. I am currently writing my thesis on the topic "The Role of Artificial

How to Summarize Legal Orders and Opinions Using the GPT-3.5

Embracing Complexity in Governance: Leveraging High-Dimensional Data and Computational Philosophy for Dynamic Policy-Making