At Wit's End On LLM performance?
LLM performance - article by Venkat Ramakrishnan. Image: Digital Content Writers India @ Unsplash

At Wit's End On LLM performance?

Nowadays LLMs' performance is a daily topic! Me, like you, go awestruck looking at those magical numbers when an article declares that so-and-so LLM beat every other LLM in the performance race and is on the top of the list as of yesterday! It used to make my head spin for a while, but not anymore, because I have learnt to cut through the mambo-jumbo, and understand what those LLM performance indicators are, and how it is important for my application.

Let's look at that list, which typically goes as follows:

  • HumanEval: The performance of an LLM in its ability to generate correct code. This is not tested on the wild, but tested against a carefully curated dataset which analyzes different aspects of code generation like function signatures, docstrings, code body, and unit tests.
  • MMLU (Massive Multi-task Language Understanding): The performance of the LLM is checked against around 57 tasks through multi-choice questions, whose subjects include Science, Social Studies, Common sense reasoning, and so on. The final score of an LLM is derived through average scoring across tasks. This provides a holistic view of LLM's overall language understanding.
  • MGSM (Multi-lingual GSM8K): This is a test of around 250 arithmetic problems in multiple languages. The key of this test lies in the ability of the LLM to understand the arithmetic problem in multiple languages and EXPLAIN its reasoning on why it thinks its answer is right! Sounds interesting? Differentiate this from standard GSM8K, which is the English-based reasoning for arithmetic problem-solving.
  • MATH (Mathematics): LLMs are currently not positioned well for solving mathematical problems, because LLMs are trained for generating the next word and not prepared for step-by-step mathematical problem solving approach. Thus, ability to solve MATH problems is an important factor in LLM performance. There is current research going on on solving MATH problems by an LLM. An interesting, yet, to be improved, use case.
  • GPQA (Graduate-level Google-proof Q&A Benchmark): Designed to test scientific knowledge at a graduate level, the questions are designed as open-ended, and 'Google-proof' (as the answers cannot be obtained by a simple Google search!). It is one of the toughest benchmarks for an LLM performance, as the best known LLM performance grading is only 39%! The test rigorously measures LLM performance in its reasoning abilities on scientific topics.
  • DROP (Discrete Reasoning Over Paragraphs): This test evaluates the ability of an LLM in going over a set of paragraphs, comprehending, and reasoning based on its understanding. This test is valuable for assessing understanding beyond basic comprehension and real-world applications.

So, there you go, the exciting world of LLM performance assessment tests! As a Software Testing professional, I'm sure you can't wait to get your hands on to test a thing or two on these lines and see how your favorite LLM performed! Do let me know in the comments!

Apart from these above tests, there are other benchmarks like SuperGLUE, SQuAD, Cloze test, Lambada, and ANLG. Depending on your context, you can mix-and-match the benchmarks and choose your favorite LLM!

Guess, which LLM is leading the race based on the benchmarks discussed in detail above as of yesterday? The GPT-4 Turbo. But it is still early in the race!

For discussing on testing limited scope LLMs and their benchmarks, please feel free to get in touch with me.

要查看或添加评论,请登录

Venkat Ramakrishnan的更多文章

  • Security Testing Of Autonomous Vehicles

    Security Testing Of Autonomous Vehicles

    Still an young field, and there's lot of scope to get into and be an expert! This is about security testing of…

    1 条评论
  • Quality Of Zero-Click Search Results

    Quality Of Zero-Click Search Results

    Let's talk about quality of Zero-Click search results!…

  • Streamlining Testing Process

    Streamlining Testing Process

    Let's talk about streamlining testing process: https://venkatramakrishnan.com/2025/03/16/testing-process-streamlining/

    2 条评论
  • Measuring Software Quality

    Measuring Software Quality

    Let's talk about how to measure software quality in the modern environments:…

  • Busting Regression Testing Myths

    Busting Regression Testing Myths

    In this article, let's bust some regression testing myths!…

  • Avoiding Test Results Conflicts

    Avoiding Test Results Conflicts

    Let's talk about the three key pillars that would contribute to avoiding test results conflicts! Here:…

  • Test Prioritization

    Test Prioritization

    We encounter difficulties on Test Prioritization on a daily basis. We are challenged because we need to deliver fast…

  • Skipping Testing Activities

    Skipping Testing Activities

    Skipping testing activities might make sense if the test types are not relevant to the situation at hand. One may…

  • Balancing Thorough Testing and Fast Feedback

    Balancing Thorough Testing and Fast Feedback

    Pressure to deliver as soon as possible and upholding efforts for superior quality are two conflicting goals because by…

  • How To Test Last Minute Features

    How To Test Last Minute Features

    We have all been through situations where we are asked to do quality analysis and testing last minute features. In the…

社区洞察

其他会员也浏览了