登录查看更多内容

At Wit's End On LLM performance?

Venkat Ramakrishnan

Chief Quality Officer | Software Testing Technologist | Keynote Speaker | Corporate Storyteller

发布日期: 2024年4月15日

Nowadays LLMs' performance is a daily topic! Me, like you, go awestruck looking at those magical numbers when an article declares that so-and-so LLM beat every other LLM in the performance race and is on the top of the list as of yesterday! It used to make my head spin for a while, but not anymore, because I have learnt to cut through the mambo-jumbo, and understand what those LLM performance indicators are, and how it is important for my application.

Let's look at that list, which typically goes as follows:

HumanEval: The performance of an LLM in its ability to generate correct code. This is not tested on the wild, but tested against a carefully curated dataset which analyzes different aspects of code generation like function signatures, docstrings, code body, and unit tests.
MMLU (Massive Multi-task Language Understanding): The performance of the LLM is checked against around 57 tasks through multi-choice questions, whose subjects include Science, Social Studies, Common sense reasoning, and so on. The final score of an LLM is derived through average scoring across tasks. This provides a holistic view of LLM's overall language understanding.
MGSM (Multi-lingual GSM8K): This is a test of around 250 arithmetic problems in multiple languages. The key of this test lies in the ability of the LLM to understand the arithmetic problem in multiple languages and EXPLAIN its reasoning on why it thinks its answer is right! Sounds interesting? Differentiate this from standard GSM8K, which is the English-based reasoning for arithmetic problem-solving.
MATH (Mathematics): LLMs are currently not positioned well for solving mathematical problems, because LLMs are trained for generating the next word and not prepared for step-by-step mathematical problem solving approach. Thus, ability to solve MATH problems is an important factor in LLM performance. There is current research going on on solving MATH problems by an LLM. An interesting, yet, to be improved, use case.
GPQA (Graduate-level Google-proof Q&A Benchmark): Designed to test scientific knowledge at a graduate level, the questions are designed as open-ended, and 'Google-proof' (as the answers cannot be obtained by a simple Google search!). It is one of the toughest benchmarks for an LLM performance, as the best known LLM performance grading is only 39%! The test rigorously measures LLM performance in its reasoning abilities on scientific topics.
DROP (Discrete Reasoning Over Paragraphs): This test evaluates the ability of an LLM in going over a set of paragraphs, comprehending, and reasoning based on its understanding. This test is valuable for assessing understanding beyond basic comprehension and real-world applications.

So, there you go, the exciting world of LLM performance assessment tests! As a Software Testing professional, I'm sure you can't wait to get your hands on to test a thing or two on these lines and see how your favorite LLM performed! Do let me know in the comments!

领英推荐

??Top ML Papers of the Week

DAIR.AI 7 个月前

Constitutional A.I. and the Math Achievement Gap

Patrick Delaney 1 年前

Free LLM Roadmap 1-hour Crash Course - From Beginner…

Youssef Hosni 1 个月前

Apart from these above tests, there are other benchmarks like SuperGLUE, SQuAD, Cloze test, Lambada, and ANLG. Depending on your context, you can mix-and-match the benchmarks and choose your favorite LLM!

Guess, which LLM is leading the race based on the benchmarks discussed in detail above as of yesterday? The GPT-4 Turbo. But it is still early in the race!

For discussing on testing limited scope LLMs and their benchmarks, please feel free to get in touch with me.

At Wit's End On LLM performance?

Venkat Ramakrishnan

Chief Quality Officer | Software Testing Technologist | Keynote Speaker | Corporate Storyteller

领英推荐

Quality Pivot

575 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Mathematics, Coding, and Philosophy (The Trifecta for Future Autonomy)

Greek Syntax & Propositional Logic In Philosophy, Math, CS & Beyond

e: Α number who can teach

An Open Synthesis On SBTs

Optimal Advisor-Advisee Matching

What are the benefits and challenges of International Assessments and Frameworks?

Behold: The Thompson Postulate

Proof Exists in Mathematics.

VON RESTORFF EFFECT: HOW TO USE IT TO SUPERCHARGE YOUR LEARNING

2+2=5

领英推荐

Quality Pivot

575 位关注者

On RAGs and Riches

2024年6月6日

The System Testing Of AI

2024年5月28日

A bit about hallucinations

2024年4月27日

The Curious Case Of Software Naming

2024年4月8日

Prevention Is Better Than Cure

2024年3月31日

Do Trillions Of Parameters Help In LLM Effectiveness?

2024年3月23日

Integration Nightmare: The Case Of Super-flexible e-commerce platforms

2024年3月15日

Rocket Science: An Emerging Quality and Testing Opportunity

2024年3月11日

Verify, Then Trust

2024年3月1日

To Bell The Generative AI Cat

2024年2月23日

社区洞察

其他会员也浏览了

Mathematics, Coding, and Philosophy (The Trifecta for Future Autonomy)

Greek Syntax & Propositional Logic In Philosophy, Math, CS & Beyond

e: Α number who can teach

An Open Synthesis On SBTs

Optimal Advisor-Advisee Matching

What are the benefits and challenges of International Assessments and Frameworks?

Behold: The Thompson Postulate

Proof Exists in Mathematics.

VON RESTORFF EFFECT: HOW TO USE IT TO SUPERCHARGE YOUR LEARNING

2+2=5