Nowadays LLMs' performance is a daily topic! Me, like you, go awestruck looking at those magical numbers when an article declares that so-and-so LLM beat every other LLM in the performance race and is on the top of the list as of yesterday! It used to make my head spin for a while, but not anymore, because I have learnt to cut through the mambo-jumbo, and understand what those LLM performance indicators are, and how it is important for my application.
Let's look at that list, which typically goes as follows:
- HumanEval: The performance of an LLM in its ability to generate correct code. This is not tested on the wild, but tested against a carefully curated dataset which analyzes different aspects of code generation like function signatures, docstrings, code body, and unit tests.
- MMLU (Massive Multi-task Language Understanding): The performance of the LLM is checked against around 57 tasks through multi-choice questions, whose subjects include Science, Social Studies, Common sense reasoning, and so on. The final score of an LLM is derived through average scoring across tasks. This provides a holistic view of LLM's overall language understanding.
- MGSM (Multi-lingual GSM8K): This is a test of around 250 arithmetic problems in multiple languages. The key of this test lies in the ability of the LLM to understand the arithmetic problem in multiple languages and EXPLAIN its reasoning on why it thinks its answer is right! Sounds interesting? Differentiate this from standard GSM8K, which is the English-based reasoning for arithmetic problem-solving.
- MATH (Mathematics): LLMs are currently not positioned well for solving mathematical problems, because LLMs are trained for generating the next word and not prepared for step-by-step mathematical problem solving approach. Thus, ability to solve MATH problems is an important factor in LLM performance. There is current research going on on solving MATH problems by an LLM. An interesting, yet, to be improved, use case.
- GPQA (Graduate-level Google-proof Q&A Benchmark): Designed to test scientific knowledge at a graduate level, the questions are designed as open-ended, and 'Google-proof' (as the answers cannot be obtained by a simple Google search!). It is one of the toughest benchmarks for an LLM performance, as the best known LLM performance grading is only 39%! The test rigorously measures LLM performance in its reasoning abilities on scientific topics.
- DROP (Discrete Reasoning Over Paragraphs): This test evaluates the ability of an LLM in going over a set of paragraphs, comprehending, and reasoning based on its understanding. This test is valuable for assessing understanding beyond basic comprehension and real-world applications.
So, there you go, the exciting world of LLM performance assessment tests! As a Software Testing professional, I'm sure you can't wait to get your hands on to test a thing or two on these lines and see how your favorite LLM performed! Do let me know in the comments!
Apart from these above tests, there are other benchmarks like SuperGLUE, SQuAD, Cloze test, Lambada, and ANLG. Depending on your context, you can mix-and-match the benchmarks and choose your favorite LLM!
Guess, which LLM is leading the race based on the benchmarks discussed in detail above as of yesterday? The GPT-4 Turbo. But it is still early in the race!
For discussing on testing limited scope LLMs and their benchmarks, please feel free to get in touch with me.