登录查看更多内容

At Wit's End On LLM performance?

Venkat Ramakrishnan

Chief Quality Officer | Software Testing Technologist | Keynote Speaker | Corporate Storyteller

发布日期: 2024年4月15日

Nowadays LLMs' performance is a daily topic! Me, like you, go awestruck looking at those magical numbers when an article declares that so-and-so LLM beat every other LLM in the performance race and is on the top of the list as of yesterday! It used to make my head spin for a while, but not anymore, because I have learnt to cut through the mambo-jumbo, and understand what those LLM performance indicators are, and how it is important for my application.

Let's look at that list, which typically goes as follows:

HumanEval: The performance of an LLM in its ability to generate correct code. This is not tested on the wild, but tested against a carefully curated dataset which analyzes different aspects of code generation like function signatures, docstrings, code body, and unit tests.
MMLU (Massive Multi-task Language Understanding): The performance of the LLM is checked against around 57 tasks through multi-choice questions, whose subjects include Science, Social Studies, Common sense reasoning, and so on. The final score of an LLM is derived through average scoring across tasks. This provides a holistic view of LLM's overall language understanding.
MGSM (Multi-lingual GSM8K): This is a test of around 250 arithmetic problems in multiple languages. The key of this test lies in the ability of the LLM to understand the arithmetic problem in multiple languages and EXPLAIN its reasoning on why it thinks its answer is right! Sounds interesting? Differentiate this from standard GSM8K, which is the English-based reasoning for arithmetic problem-solving.
MATH (Mathematics): LLMs are currently not positioned well for solving mathematical problems, because LLMs are trained for generating the next word and not prepared for step-by-step mathematical problem solving approach. Thus, ability to solve MATH problems is an important factor in LLM performance. There is current research going on on solving MATH problems by an LLM. An interesting, yet, to be improved, use case.
GPQA (Graduate-level Google-proof Q&A Benchmark): Designed to test scientific knowledge at a graduate level, the questions are designed as open-ended, and 'Google-proof' (as the answers cannot be obtained by a simple Google search!). It is one of the toughest benchmarks for an LLM performance, as the best known LLM performance grading is only 39%! The test rigorously measures LLM performance in its reasoning abilities on scientific topics.
DROP (Discrete Reasoning Over Paragraphs): This test evaluates the ability of an LLM in going over a set of paragraphs, comprehending, and reasoning based on its understanding. This test is valuable for assessing understanding beyond basic comprehension and real-world applications.

So, there you go, the exciting world of LLM performance assessment tests! As a Software Testing professional, I'm sure you can't wait to get your hands on to test a thing or two on these lines and see how your favorite LLM performed! Do let me know in the comments!

领英推荐

??Top ML Papers of the Week

DAIR.AI 11 个月前

“F*** the Algorithm”, It’s Time to Get Our Agency Back…

Jean-Fran?ois Gagné 3 年前

Constitutional A.I. and the Math Achievement Gap

Patrick Delaney 1 年前

Apart from these above tests, there are other benchmarks like SuperGLUE, SQuAD, Cloze test, Lambada, and ANLG. Depending on your context, you can mix-and-match the benchmarks and choose your favorite LLM!

Guess, which LLM is leading the race based on the benchmarks discussed in detail above as of yesterday? The GPT-4 Turbo. But it is still early in the race!

For discussing on testing limited scope LLMs and their benchmarks, please feel free to get in touch with me.

Quality Pivot

583 位关注者

要查看或添加评论，请登录

Venkat Ramakrishnan的更多文章

Security Testing Of Autonomous Vehicles

2025年3月19日

Security Testing Of Autonomous Vehicles

Still an young field, and there's lot of scope to get into and be an expert! This is about security testing of…

1 条评论
Quality Of Zero-Click Search Results

2025年3月17日

Quality Of Zero-Click Search Results

Let's talk about quality of Zero-Click search results!…
Streamlining Testing Process

2025年3月16日

Streamlining Testing Process

Let's talk about streamlining testing process: https://venkatramakrishnan.com/2025/03/16/testing-process-streamlining/

2 条评论
Measuring Software Quality

2025年3月15日

Measuring Software Quality

Let's talk about how to measure software quality in the modern environments:…
Busting Regression Testing Myths

2025年3月14日

Busting Regression Testing Myths

In this article, let's bust some regression testing myths!…
Avoiding Test Results Conflicts

2025年3月12日

Avoiding Test Results Conflicts

Let's talk about the three key pillars that would contribute to avoiding test results conflicts! Here:…
Test Prioritization

2025年3月11日

Test Prioritization

We encounter difficulties on Test Prioritization on a daily basis. We are challenged because we need to deliver fast…
Skipping Testing Activities

2025年3月10日

Skipping Testing Activities

Skipping testing activities might make sense if the test types are not relevant to the situation at hand. One may…
Balancing Thorough Testing and Fast Feedback

2025年3月9日

Balancing Thorough Testing and Fast Feedback

Pressure to deliver as soon as possible and upholding efforts for superior quality are two conflicting goals because by…
How To Test Last Minute Features

2025年3月8日

How To Test Last Minute Features

We have all been through situations where we are asked to do quality analysis and testing last minute features. In the…

See all articles

At Wit's End On LLM performance?

Venkat Ramakrishnan

Chief Quality Officer | Software Testing Technologist | Keynote Speaker | Corporate Storyteller

领英推荐

Quality Pivot

583 位关注者

Venkat Ramakrishnan的更多文章

社区洞察

其他会员也浏览了

e: Α number who can teach

Certifiable

Simplicity is the Key

LLM Math?

Behold: The Thompson Postulate

Story Points and the Pythagorean Trap.

Proof Exists in Mathematics.

Automatic Thoughts

My ZIB Visit

I Should Have Learned my Times Tables

领英推荐

Quality Pivot

583 位关注者

Venkat Ramakrishnan的更多文章

Security Testing Of Autonomous Vehicles

Quality Of Zero-Click Search Results

Streamlining Testing Process

Measuring Software Quality

Busting Regression Testing Myths

Avoiding Test Results Conflicts

Test Prioritization

Skipping Testing Activities

Balancing Thorough Testing and Fast Feedback

How To Test Last Minute Features

社区洞察

其他会员也浏览了

e: Α number who can teach

Certifiable

Simplicity is the Key

LLM Math?

Behold: The Thompson Postulate

Story Points and the Pythagorean Trap.

Proof Exists in Mathematics.

Automatic Thoughts

My ZIB Visit

I Should Have Learned my Times Tables