7 Ways to Test LLMs

7 Ways to Test LLMs

In a very short time, large language models (LLMs) have spread comparatively quickly. Numerous businesses have reaped enormous benefits from using AI solutions, which have provided a creative answer for anything from straightforward jobs to intricate data processing. Despite their apparent benefits, LLMs have drawbacks, particularly in light of their propensity for hallucinations. Determining whether a tool like Slack is appropriate for a business is a simple task. But evaluating something like an LLM is more difficult. LLM testing is useful in this situation.

Because LLMs are relatively new in their current form, there are many competing standards for measuring their efficacy. Below, we’ll look at seven methods and standards for testing LLMs. While this is not an exhaustive list and is subject to change as the industry evolves, it does act as a good primer for organizations just stepping into this question.

1. BERTScore

To gauge how well LLM models perform in relation to a reference phrase, Google's LLM researchers released the BERTScore metric in October 2018. It is based on the Bidirectional Encoder Representations from Transformers language model. The output of a model can be scored and given to evaluate the relative similarity between the two by giving reference sentence tokens and comparing them to the tokens in the candidate sentence. More precisely, the cosine similarity between these two items is the BERTScore. Nevertheless, in creating the top-level similarity score, extra metrics like recall and precision are supplied.

While BERTScore is still in use, its shortcomings have prompted the development of alternative approaches. The primary one is that BERTScore is limited to a subset of languages. This restricts its applicability as a testing model to certain use situations, even though it might be sufficient for widely used languages. Besides being rather vast, the model is commonly perceived as a brute-force approach that compares references to created content without much thought given to novel interpretation or meta-contextual transformation. This flaw has spurred further advancements in the field, most notably BLEURT, which applies regression training to produce a measure that reflects the contextuality and comprehensibility of the generated content in addition to the initial BERTScore criteria.

2. ROUGE

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a metric that was first proposed in 2004 as a methodology for comparing an original text to generated text, forming a summary or translation. Notably, ROUGE is not a standalone metric and instead has several sub-metrics:

  • ROUGE-N: This metric measures the n-grams matching between the reference text and the generated text. This roughly estimates the overall similarity of the text by measuring how many words, and in what order, match between the two sources.
  • ROUGE-1 and ROUGE-2: These metrics consider the matches at the level of unigrams and bigrams, adding granularity to the n-gram matching sequences computed in ROUGE-N.
  • ROUGE-L: This metric is based on the longest common subsequence, or LCS, when comparing the output candidate and the reference model. This specifically calls out exact copies of text between the two elements, scoring the overall metric and preventing ‘copying’ from being seen as high-equivalence between the output and reference.
  • ROUGE-S: This metric is a skip-gram test, allowing the detection of n-grams that would match but are nonetheless separated by additional context or words. This allows for the detection of matches that occur with additional separation. For instance, ‘AI testing methods’ would be detected in this model even if the generated text said ‘AI and LLM testing methods,’ allowing for more fuzzy logic within the overall metric.

The main drawback of ROUGE is that it’s based on syntax detection rather than semantic detection. By only looking for repetitive frequency, you can score the output on visual similarity but not on similarity in meaning. This metric has been rightly criticized as being one of uniformity of recall rather than the quality of the output. That being said, this very focus has made it useful in detecting text originating from LLMs or direct copying.

3. BLEU

BLEU, or Bilingual Evaluation Understudy, is an older metric first published by researchers from IBM in 2002. Originally, BLEU was specifically concerned with machine translation, comparing the similarity of n-grams between a high-quality reference model and a specific segment from an output text. Scoring the relationship between 0 and 1, with 1 being perfect, BLEU was essentially comparing the quality of the translation to a “known-perfect” reference model.

The measure is machine translation-focused, however it has a significant flaw in the design. It ranks according to correlation with that reference and believes that there is only one legitimate translation, or collection of translations. Theoretically, this is okay when comparing tiny samples, like "konnichiwa" to "hello," but it ignores other appropriate translations, such "good afternoon." BLEU is widely used in natural language processing (NLP), and this has led to a trend where scoring is based on approved output without providing a strong explanation for why there is not a larger set of accepted output in the first place.

Tokens are primarily used in BLEU's extremely particular method of generating the initial score. BLEU numbers are frequently connected without the context of their token values, despite the fact that this is identical to other models. This implies that BLEU numbers might vary greatly, even while using a shared corpus, which limits its applicability in most situations.

4. MMLU and MMLU Pro

The MMLU, or Massive Multi-task Language Understanding test, allows LLMs to be tested against certain tasks with domains of expertise. It was first proposed in September 2020 by a paper by a group of LLM and AI researchers as a novel test to measure accuracy across a wide set of domains. The test relies on a set of question-and-answer pairs that represent advanced knowledge within 57 topics, reaching across mathematics, law, world history, and more. The test makers do not publish the correct answers to these questions to prevent poisoning of the results, making this an opaque — albeit useful — test.

An LLM's response to the MMLU test is evaluated based on a number of factors, such as coherence, relevance, clarity, and detail. The overall accuracy of the response in relation to the predicted replies is then represented by a numerical score that is generated from this result.

The MMLU test has come under fire for having question-answer pairs that are problematic in terms of accuracy and possible bias, as well as questions that are poorly worded or have a semantically structured format. A team of researchers created the MMLU Pro exam in June 2024 in an effort to address these problems. By fixing errors and semantic problems in the original data set, this test aimed to raise overall difficulty while decreasing test answer variability caused by prompt variance. MMLU Pro is being used instead of or in addition to the original MMLU test, despite ongoing criticisms over its bias and accuracy.

5. GLUE

GLUE, the General Language Understanding Evaluation, is a metric benchmark that is purposefully general and decoupled from specific tasks. Unlike BLEU or ROUGE, it is meant to be a holistic and general-purpose metric across nine domains of NLP tasks, including sentiment, question answers, sentence similarity (akin to ROUGE), and more.

The express design and form purpose behind GLUE is split into three core tenets:

  • A benchmark of nine sentence or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty.
  • A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language.
  • A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.

Because GLUE is entirely model and platform-agnostic, it can compare LLMs across different formats, structures, and approaches. This removes the limitations of previous metrics, which were designed specifically for sub-tasks such as language translation, even if they were ultimately used for something more broadly applicable.

The biggest criticism of GLUE is that it was ultimately a “one size fits all” approach to metric testing, providing a single number that gave some sort of statement as to quality and performance. To rectify this, an additional metric, SuperGLUE, has been introduced, representing a “new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard.”

6. G-Eval

G-Eval is very much a response to the issues inherent in previous testing metrics, and as such, it takes a different tack on testing altogether. In essence, G-Eval focuses on the context of the generated content rather than just the semantic similarity to a reference set. By using chain-of-thoughts (CoT), a mechanism by which the LLM holds on to a reasoning chain and utilizes it as part of an ongoing prompt dialogue, the internal logic of the LLM is made as important as the resultant text.

Although it appears to be a minor change, in actuality it is quite significant. In LLM systems, many hallucinations and other problems occur in the follow-up generation rather than in the initial prompt or the first line of output. The accuracy of the output with respect to a reference model is tested by simple syntax testing, but the reasoning behind the process is not tested. This is acceptable for simple output; nevertheless, a measure of this logical consistency and correctness is crucial for complex systems, particularly in the context of LLMs, whose processing processes are growing increasingly intricate and precise.

Although this has led to more significant advancements in accuracy measures and LLM evaluation, it is still, in many respects, constrained by the problems with benchmarking. Remarkably, G-Eval remains reliant on the datasets utilized for its LLM assessment, potentially introducing bias and accounting for accuracy in the dataset instead of aligning with the output's human judgment. Higher datasets can undoubtedly be used to get past this, but G-Eval has a single source of truth that is quite ambiguous as to where the training material originates from because it uses GPT-4.

7. HELM

HELM, or the Holistic Evaluation of Language Models, is a unique metric. First published in October 2022, it focuses more on a comprehensive approach than on any particular attribute. Its model stretches across seven metrics, as noted in the original paper: “We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time).”

Notably, this indicates that HELM is less concerned with a response's proximity to the reference sentence or prompt and more focused on the generation aim. To put it another way, HELM assesses quality, particularly in relation to an answer's logical coherence, the false information that is contextually associated with it, and the degree to which the output generally aligns with the intended purpose of creation. It is crucial to remember that HELM still heavily depends on a number of datasets, including MMLU, LegalBench, MedQA, and OpenbookQA. Because of this, even though the algorithm and internal logic are distinct, they nevertheless rely in part on perhaps skewed or inaccurate content.

要查看或添加评论,请登录

Kshitij Sharma的更多文章

社区洞察

其他会员也浏览了