Quantitative Evaluation of LLM Responses with RAG-based Question-answering Chatbots
Image by the author using Midjourney

Quantitative Evaluation of LLM Responses with RAG-based Question-answering Chatbots


Examining quantitative metrics, including sacreBLEU, TER, ChrF, ChrF++, BERTScore, ROUGE, METEOR, and Semantic Similarity to evaluate and compare the quality of responses from more than a dozen popular Large Language Models?(LLMs)

To quote LangChain’s recent post, Conversational Retrieval Agents, “LLMs only know what they are trained on. To combat this, a style of generation known as “retrieval augmented generation” has emerged. In this technique, documents are retrieved and then inserted into the prompt, and the language model is instructed to only respond based on those documents. This helps both in giving the language model additional context as well as in keeping it grounded.” Similarly, to paraphrase Meta’s original September 2020 post, Retrieval Augmented Generation: Streamlining the creation of intelligent natural language processing models, “Retrieval Augmented Generation (RAG) has two sources of knowledge: the knowledge that seq2seq models store in their parameters (parametric memory) and the knowledge stored in the corpus from which RAG retrieves passages (nonparametric memory). Further, these two sources complement each other. We [Meta] found that RAG uses its nonparametric memory to “cue” the seq2seq model into generating correct responses, essentially combining the flexibility of the “closed-book” or parametric-only approach with the performance of “open-book” or retrieval-based methods.

When building RAG-based Generative AI applications, such as conversational and question-answering chatbots, an endlessly expanding choice of LLMs (aka Foundation Models) is available. With so many LLM options, how do we qualitatively and quantitatively evaluate and compare the responses from multiple LLMs to a single user prompt consistently and repeatedly? This blog post will explore several quantitative metrics to compare and contrast the responses (answers) from over a dozen currently popular LLMs to individual user prompts (questions) when using RAG-based architectures.

Since this post is focused on RAG-based generative responses, the responses must be not only accurate but accurate to the nonparametric memory (supplied contextual reference). Contextual reference increases LLM response accuracy and helps negate hallucinations. For this blog, the contextual reference will be collections of web-based documents indexed using Amazon Kendra. Responses to user prompts should be based on the corpus of knowledge (relevant documents) contained in the Kendra indexes. The ability to understand, extract, and synthesize the relevant document’s contents and formulate a coherent response is the responsibility of the LLM and its parametric memory.

Although there are many similarities in the types of Generative AI applications organizations are currently building, each organization’s contextual reference (intellectual property, organizational knowledge, corpus of specific content), and the desired communication style as personified in the generative AI application (tone, tenor, pace, mood, voice, syntax, diction, and length of response) are uniquely different. This makes globally evaluating responses beyond just simple correctness and completeness difficult. Ultimately, each organization must decide if they feel the quality of response they are getting from the LLM meets their requirements.

Large Language?Models

The following LLMs were examined as part of the post, in alphabetical order:

  1. AI21 Labs Jurassic-2 Mid (fka Grande Instruct)
  2. AI21 Labs Jurassic-2 Ultra (fka Jumbo Instruct)
  3. Amazon Titan Text Large
  4. Anthropic Claude v1.3
  5. Anthropic Claude Instant v1.1
  6. Anthropic Claude v2
  7. Cohere Generate Command
  8. Cohere Generate Command Light
  9. Google Flan-T5 XL
  10. Google Flan-T5 XXL FP16
  11. Meta Llama-2 13B Chat HF
  12. OpenAI GPT-3.5 Turbo
  13. OpenAI GPT-4

Testing Approach

Disclaimer: This post’s approach to testing LLMs is specific to RAG-based question-answering chatbots and the technologies, indexed content, prompt template, prompts, LLM parameters, and versions of software packages used. The test results are not meant to provide a definitive opinion on the quality of the LLMs tested. Your use cases, results, and qualitative and quantitative requirements will differ.

Preview of Automated LLM Quantitative Evaluation Test Harness (No Audio)

Variability

Many variables impact the response of an LLM, including:

  1. Model variations: Parameter size (3b, 7b, 13b, 40b, 70b), model type (instruct, chat, light, instant, hf), precision (fp32, fp16), and quantization
  2. Model parameters: Maximum tokens, temperature, top-p, top-k, and frequency penalty
  3. User prompt: Changing even a single word or punctuation may result in vastly different responses
  4. Prompt template: Determines tone, tenor, pace, mood, voice, syntax, diction, format, and length of response and helps enforce the use of the supplied contextual reference
  5. Nonparametric memory: Supplied contextual reference passed to the LLM as part of RAG from the Amazon Kendra index

This post will focus on the choice of model to impact the response’s quality. The prompt, prompt template, and supplied contextual reference of relevant documents passed to the models from Amazon Kendra will be identical for each prompt.

Model Parameters

Available parameters and parameter scales vary across models. All model temperatures will be set to approximately zero. Since not all models accept zero (0) as a temperature value, all model temperatures will be set to 0.0000000001 (1e-10 or 10^(–10)). A temperature of zero will ensure consistent responses to the same prompt. We will somewhat arbitrarily limit the maximum number of tokens in the response size to 1,024 since most accepted responses are well within this limit. The tests will rely on the model’s default values for the remaining model parameters.

Kendra Indexes and?Prompts

To compare and contrast the responses from multiple LLMs and explore different metrics, the post uses three different prompts for three different Amazon Kendra indexes for a total of nine different prompts or 117 responses (9 responses/model x 13 models). Each response will be compared to what was determined, subjectively, to be the correct response(s) for the prompt. All three indexes were built using a web crawler and a pre-defined sitemap to precisely control what content was indexed and made available to the LLM.

Index 1: Amazon SageMaker Documentation (official online docs)

  • Prompt 1: “What is Amazon SageMaker?
  • Prompt 2: “What is Amazon SageMaker Inference Recommender?
  • Prompt 3: “How can I label data using a human?

Index 2: Physical Geology — 2nd Edition (PDF-based book from opentextbc.ca)

  • Prompt 4: “Describe the role of a geologist.
  • Prompt 5: “What is an unconformity, and how many different types are there?
  • Prompt 6: “What are the main factors that control the metamorphic process?

Index 3: Investor.gov Investment Guidance (sponsored by SEC.gov)

  • Prompt 7: “What is a Ponzi scheme, and how can I avoid them?
  • Prompt 8: “What types of investment products are available?
  • Prompt 9: “What is a Public Company?

Quantitative Metrics

There are many different quantitative metrics we could use to evaluate and compare the responses from multiple LLMs. Ultimately, none of the chosen ones will likely mirror your own qualitative opinion. The goal is to find correlations between certain quantitative metrics and your qualitative opinions. The quantitative metrics used in this post are:

  1. BiLingual Evaluation Understudy (BLEU)
  2. Translation Error Rate (TER)
  3. CHaRacter-level F-score (ChrF)
  4. ChrF++
  5. BERTScore (Bidirectional Encoder Representations from Transformers)
  6. Recall-Oriented Understudy for Gisting Evaluation (ROUGE)
  7. Metric for Evaluation of Translation with Explicit ORdering (METEOR)
  8. Semantic Similarity (using cosine similarity)
  9. Response Length
  10. No Response (failure to respond)
  11. Response Outside Supplied Contextual Reference

Final results can be found toward the end of this post, see: Quantitative Evaluation Results. Quantitative metrics not measured in this post but equally important when choosing an LLM are model bias, maximum tokens, performance (latency), cost, and licensing.

1. BiLingual Evaluation Understudy (BLEU)

According to Wikipedia, BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text that has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human: “the closer a machine translation is to a professional human translation, the better it is” — this is the central idea behind BLEU. Invented at IBM in 2001, BLEU was one of the first metrics to claim a high correlation with human judgments of quality and remains one of the most popular automated and inexpensive metrics. According to Machine Translate, BLEU calculates the similarity between a machine translation output and a reference translation using n-gram precision.

Example BLEU Scores for Prompt #1 LLM Responses

SacreBLEU

According to the project’s GitHub repository, SacreBLEU (Post, 2018) provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official WMT scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization.

# Snippet of SacreBLEU implementation for testing RAG-based Q/A Chatbot
# Author: Gary A. Stafford
# Date: 2023-08-24
# Reference: https://github.com/mjpost/sacrebleu#using-sacrebleu-from-python

models = []
hyp_len = []
bleu_metrics = []
ter_metrics = []
chrf_metrics = []
chrfpp_metrics = []

def calculate_bleu_score(model, response, ref):
    reference = [ref]
    hypothesis = [response]

    models.append(model)

    bleu = BLEU()
    result = bleu.corpus_score(hypothesis, reference)
    hyp_len.append(str(result).split(" ")[12])
    bleu_metrics.append(str(result).split(" ")[2])
    print(f"Model: {model}\n{result}")

    ter = TER()
    result = ter.corpus_score(hypothesis, reference)
    ter_metrics.append(str(result).split(" ")[2])
    print(result)

    chrf = CHRF()
    result = chrf.corpus_score(hypothesis, reference)
    chrf_metrics.append(str(result).split(" ")[2])
    print(result)

    chrfpp = CHRF(word_order=2)
    result = chrfpp.corpus_score(hypothesis, reference)
    chrfpp_metrics.append(str(result).split(" ")[2])
    print(f"{result}\n")        

2. Translation Error Rate?(TER)

According to Machine Translate, TER (Translation Error Rate) is a metric for automatic evaluation of machine translation that calculates the number of edits required to change a machine translation output into one of the references. We will use SacreBLEU’s TER implementation.


Example TER Scores for Prompt #1 LLM Responses

3. CHaRacter-level F-score?(ChrF)

According to Machine Translate, ChrF (CHaRacter-level F-score) is a metric for machine translation evaluation that calculates the similarity between a machine translation output and a reference translation using character n-grams, not word n-grams. Metrics based on word n-grams are especially problematic for high-morphology languages (we are sticking with English, which has a less complex morphology). We will use SacreBLEU’s ChrF implementation.


Example ChrF Scores for Prompt #1 LLM Responses

4. ChrF++

According to HuggingFace, ChrF and ChrF++ are two MT evaluation metrics. They both use the F-score statistic for character n-gram matches, and ChrF++ adds word n-grams, which correlates more strongly with direct assessment. We will use SacreBLEU’s ChrF++ implementation.


Example ChrF++ Scores for Prompt #1 LLM Responses

5. BERTScore

According to Machine Translate, BERTScore is an evaluation metric using BERT (Bidirectional Encoder Representations from Transformers) sentence representations. BERTScore is a metric for automatic evaluation of machine translation that calculates the similarity between a machine translation output and a reference translation using sentence representation. BERTScore was invented as an improvement on n-gram-based metrics like BLEU.

# Snippet of BERTScore implementation for testing RAG-based Q/A Chatbot
# Author: Gary A. Stafford
# Date: 2023-08-24
# Reference: https://github.com/Tiiiger/bert_score#python-function

from bert_score import BERTScorer

bert_score_p_metrics = []
bert_score_r_metrics = []
bert_score_f1_metrics = []

def calculate_bert_score(model, response, ref):
    reference = [[ref]]
    hypothesis = [response]

    scorer = BERTScorer(lang="en", rescale_with_baseline=True)

    P, R, F1 = scorer.score(hypothesis, reference)
    bert_score_p_metrics.append(P[0].item())
    bert_score_r_metrics.append(R[0].item())
    bert_score_f1_metrics.append(F1[0].item())

    print(f"Model: {model}\nPrecision: {P[0].item():.3f}\nRecall: {R[0].item():.3f}\nF1: {F1[0].item():.3f}\n")        
Example BERTScores for Prompt #1 LLM Responses

6. Recall-Oriented Understudy for Gisting Evaluation (ROUGE)

According to the paper, ROUGE: A Package for Automatic Evaluation of Summaries (Lin, 2004), ROUGE includes several automatic evaluation methods that measure the similarity between summaries. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-grams, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S. We will use two ROUGE measures in the post:

  • ROUGE-N: (ROUGE-1, ROUGE-2, ROUNGE-n) is an n-gram recall between a candidate summary and a set of reference summaries, where n stands for the length of the n-gram.
  • ROUGE-L: Given two sequences X and Y, the Longest Common Subsequence (LCS) of X and Y is a common subsequence with maximum length. LCS has been used in identifying cognate candidates during the construction of the N-best translation lexicon from parallel text.

# Snippet of ROUGE implementation for testing RAG-based Q/A Chatbot
# Author: Gary A. Stafford
# Date: 2023-08-24
# Reference: https://github.com/pltrdy/rouge#as-a-library

from rouge import Rouge
import json

rouge_1_f_metrics = []
rouge_2_f_metrics = []
rouge_3_f_metrics = []
rouge_l_f_metrics = []

def calculate_rouge_score(model, response, ref):
    reference = ref
    hypothesis = response

    rouge = Rouge(metrics=["rouge-1", "rouge-2", "rouge-3", "rouge-l"])
    scores = rouge.get_scores(hypothesis, reference, avg=True)
    rouge_1_f_metrics.append(scores["rouge-1"]["f"])
    rouge_2_f_metrics.append(scores["rouge-2"]["f"])
    rouge_3_f_metrics.append(scores["rouge-3"]["f"])
    rouge_l_f_metrics.append(scores["rouge-l"]["f"])

    print(f"Model: {model}\nScores: {json.dumps(scores, indent=2)}\n")        
Example ROUGE F1 Scores for Prompt #1 LLM Responses

7. Metric for Evaluation of Translation with Explicit ORdering?(METEOR)

According to Wikipedia, METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a metric for the evaluation of machine-translation output. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce a good correlation with human judgment at the sentence or segment level. This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.

# Snippet of METEOR implementation for testing RAG-based Q/A Chatbot
# Author: Gary A. Stafford
# Date: 2023-08-27
# Reference: https://www.nltk.org/howto/meteor.html

import nltk
from nltk.translate import meteor
from nltk import word_tokenize

nltk.download('punkt')
nltk.download('wordnet')

meteor_metrics = []

def calculate_meteor_score(model, answer, ref):
    reference = ref
    hypothesis = answer

    score = meteor(
        [word_tokenize(reference)],
        word_tokenize(hypothesis)
    )
    meteor_metrics.append(score)

    print(f"Model: {model}\nScore: {score:.3f}\n")        
Example METEOR Scores for Prompt #1 LLM Responses

8. Semantic Textual Similarity

Semantic Textual Similarity is the task of evaluating how similar two texts are in terms of meaning. In this post, dense vector embeddings of 768 dimensions are created for the accepted response and each model’s response using one of SentenceTransformer’s pre-trained sentence-transformer models. Then, cosine similarity, a measure of similarity between two non-zero vectors defined in an inner product space, is computed from the pair of embeddings, again using SentenceTransformer. The cosine similarity always belongs to the interval [0,1] — two proportional vectors have a cosine similarity of 1, while two orthogonal vectors have a similarity of 0.

# Snippet of Semantic Similarity implementation for testing RAG-based Conversational Chatbot
# Author: Gary A. Stafford
# Date: 2023-08-24

from sentence_transformers import SentenceTransformer, util

similarity_metrics = []

def compute_cosine_similarity(model, response, ref):
    # choose SentenceTransformer model
    model = SentenceTransformer("all-mpnet-base-v2")

    # compute dense vector embeddings (768 dimensions)
    reference_embedding = model.encode(ref, convert_to_tensor=True)
    response_embedding = model.encode(response, convert_to_tensor=True)
    
    # compute cosine similarity
    cosine_score = util.cos_sim(reference_embedding, response_embedding)
    similarity_metrics.append(cosine_score[0].item())

    print(f"Model: {model}\nCosine Similarity Score: {cosine_score[0].item():.3f}\n")        
No alt text provided for this image
Example Semantic Textual Similarity Scores for Prompt #1 LLM Responses

9. Response?Length

Response length will vary significantly across models for the same prompt. The n-gram, word, or character length is not a consistent indicator of accuracy when judging the quality of a response. However, a terse or very long response might suggest issues with the response. A terse response may indicate that the model could not generate a response (e.g., “don’t know” — the default response for this post) or that the response was incomplete given the context of the prompt. A very long response may indicate that the model was unnecessarily verbose in its response, or that the response demonstrated the repetition problem — a repetition of subsequences (arXiv:2012.14660). We will use SacreBLEU’s word length implementation to compare the token length of the accepted response with the responses of the different models.

Response Length for Prompt #1 LLM Responses

10. No?Response

Based on the prompt template used for the tests, if an LLM does not know the answer, it is instructed to return the response, “don’t know.” Given that the accepted responses for all nine prompts were clearly within the supplied contextual reference passed to the LLM, we would not expect any response failures (“don’t know”). We will measure how many times the LLMs failed to respond with an answer.

11. Response Outside Supplied Contextual Reference

Using RAG, the assumption is that the response should be based on the supplied contextual reference passed to the LLM. The LLM should not attempt to respond to the prompt based on its parametric memory, the broad corpus of textual data the LLM was trained on. Why is generating responses outside the supplied contextual reference potentially harmful? Imagine creating a chatbot to provide financial education based on the specific content you have developed. However, to questions like “How should I invest one thousand dollars for the quickest returns?” your chatbot starts giving poor, biased, or potentially illicit financial advice. We will determine if any models will attempt to respond to a non-contextual prompt using the model’s parametric memory instead of the supplied contextual reference (nonparametric memory).

Qualitative Human Evaluation Results

Upon cursory analysis, all responses were accurate to the contextual references provided with each prompt. However, some anomalies were observed in the responses:

  1. Incomplete Responses: The length of the Flan-T5 Series model responses were always shorter compared to other models in the evaluation. In some cases, the terseness resulted in incomplete responses based on the prompt. For example, to the prompt, “What is an unconformity, and how many different types are there?” Google Flan-T5 XL responded, “Nonconformity A boundary between non-sedimentary rocks (below) and sedimentary rocks (above) Angular unconformity A boundary between two sequences.” There is no mention of how many different types of unconformities there are. Most other models returned complete responses to the two-part prompt.
  2. No Response: On five occasions, AI21 Labs Jurassic-2 Mid and Amazon Titan Text Large failed to respond to prompts when the correct response was clearly in the supplied contextual references.
  3. Repetition Problem: The Cohere Command Light model demonstrated the repetition problem 22% of the time, a repetition of subsequences, discussed earlier. For example, “SageMaker also provides a set of tools for analyzing and evaluating the performance of the trained models for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps…
  4. Overly Verbose Responses: Infrequently, some models, such as Meta Llama-2 13B Chat HF and Anthropic Claude v1, provided overly verbose answers, which sometimes were well beyond the context of the original prompt. The additional information was sometimes incomplete, which lowered my opinion of the response’s quality. For example, after responding, the models concluded with: “Please note that the document does not mention the following roles, so I will answer “don’t know” for them:…[then listed roles it wasn’t asked about]”, “I hope this helps provide a detailed overview of what constitutes a public company. Please let me know if you have any other questions!”, “I don’t know the specific UI that human workers would use to provide the labels and annotations. The documents do not seem to provide that level of detail. Let me know if you have any other questions!” “The document does not specify what “don’t know” refers to.” or “Hello! I’m here to help answer your questions. Based on the provided documents, I can provide information on unconformities.
  5. Including Chain-of-thought in Response: Infrequently, some models, such as Anthropic Claude v1.3, included their “chain-of-thought” in their responses. For example, “Instruction: What do geologists study? Answer “don’t know” if not present in the document. Solution: Geologists study the Earth and its processes. According to the documents:…
  6. Provided Adequate Responses, but then Stated it Didn’t Know: Infrequently, some models such as Meta Llama-2 13B Chat HF, OpenAI GPT3.5 Turbo, OpenAI GPT 4, and Anthropic Claude Instant v1 provided adequate responses to the prompts, but then claimed they were unable to respond. For example, after delivering responses, the models concluded with “don’t know” or “…the specific details about Amazon SageMaker are not mentioned in the document excerpt. Therefore, I don’t have specific information about what Amazon SageMaker is based on the provided documents.

Image by the author using Midjourney

Quantitative Evaluation Results

1. Median BLEU?Scores

Recall that BLEU calculates the similarity between a machine translation output and a reference translation using n-gram precision. A higher score is considered better within a range of 0–100. Based on the median BLEU score of all nine prompt responses, OpenAI GPT-4 was in the top position, closely followed by OpenAI GPT-3.5 Turbo. They are followed by Meta Llama-2 13B Chat HF, Cohere Command, and Cohere Command Light. Median was used instead of mean due to some clear outliers in the results.

No alt text provided for this image
Median BLEU Scores for all Responses

2. Median TER?Scores

Recall that TER is a metric for automatic evaluation of machine translation that calculates the number of edits required to change a machine translation output into one of the references. A lower score is considered better, meaning fewer edits are needed. Taking the median TER scores of all nine prompt responses, AI21 Labs Jurassic-2 Ultra, Cohere Command, OpenAI GPT-3.5 Turbo, Google Flan-T5 XXL FP16, and OpenAI GPT-4 all finished in the top five.

No alt text provided for this image
Median TER Scores for all Responses

Note the high score of the Cohere Command Light model. The model demonstrated a repetition problem, discussed earlier, in a few instances. For example (abridged): “SageMaker also provides a set of tools for analyzing and evaluating the performance of the trained models for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps…” This caused the response length and TER score to balloon.

3. Mean ChrF?Scores

Recall that ChrF measures the similarity between a machine translation output and a reference translation using character n-grams, not word n-grams. A higher score is considered better within a range of 0–100. Taking the mean of the ChrF scores of all nine prompt responses, OpenAI GTP-4, OpenAI GPT-3.5 Turbo, AI21 Labs Jurassic-2 Ultra, Meta Llama-2 13B Chat HF, and Cohere Command were in the top five.

No alt text provided for this image
Mean ChrF Scores for all Responses

4. Mean ChrF++?Scores

Recall that both metrics use the F-score statistic for character n-gram matches, while ChrF++ adds word n-grams, which correlates more strongly with direct assessment. We see nearly identical results with both metrics based on the mean of the scores.

No alt text provided for this image
Mean ChrF++ Scores for all Responses

5. Mean BERTScore F1?Scores

Recall that BERTScore calculates the similarity between a machine translation output and a reference translation using sentence representation. The post’s BERTScore function returns Precision, Recall, and the F1 score. We will look at the mean of the F1 score. According to Wikipedia, the F1 score is the harmonic mean of precision and recall. It thus symmetrically represents both precision and recall in one metric. A higher score is considered better within a range of 0–1. Taking the mean F1 BERTScore for all nine prompt responses, AI21 Labs Jurassic-2 Ultra was at the top of the list, well ahead of Cohere Command in second position. They were followed by Google Flan-T5 XXL FP16, OpenAI GPT-3.5 Turbo, and OpenAI GPT-4.

No alt text provided for this image
Mean F1 BERTScore for all Responses

6. Mean ROUGE-1, ROUGE-2, and ROUGE-L F1?Scores

Recall that ROUGE-N is an n-gram recall between a candidate summary and a set of reference summaries, where n is the n-gram length. ROUGE-L calculates the Longest Common Subsequence (LCS) from two sequences, X and Y. Within a range of 0–1, a higher score is considered better. Taking the mean of the ROUGE-1, ROUGE-2, and ROUGE-L F1 scores of all nine prompt responses and sorting by ROUGE-1, we see AI21 Labs Jurassic-2 Ultra at the top of the list again, followed by OpenAI GPT-4, Google Flan-T5 XXL FP16, OpenAI GPT-3.5 Turbo, and Cohere Command.

No alt text provided for this image
Mean of the ROUGE-1, ROUGE-1, and ROUGE-L F1 scores, sorted by ROUGE-1, for all Responses

Taking the mean of the ROUGE-1, ROUGE-1, and ROUGE-L F1 scores of all nine prompt responses, and this time sorting by ROUGE-L, we see the same five models, in a slightly different order: AI21 Labs Jurassic-2 Ultra, followed by Google Flan-T5 XXL FP16, OpenAI GPT-4, Cohere Command, and OpenAI GPT-3.5 Turbo.

No alt text provided for this image
Mean of the ROUGE-1, ROUGE-1, and ROUGE-L F1 scores, sorted by ROUGE-L, for all Responses

7. Mean METEOR?Scores

Recall that METEOR is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. Within the range of 0–1, a higher score is considered better. We see OpenAI GPT-4 and OpenAI GPT-3.5 Turbo at the top of the list, followed by Cohere Command, AI21 Labs Jurassic-2 Ultra, Meta Llama-2 13B Chat HF, and Anthropic Claude v2.

No alt text provided for this image
Mean METEOR Scores for all Responses

8. Mean Semantic Similarity (using Cosine Similarity)

Recall that semantic similarity evaluates how similar two texts are in terms of meaning. After dense vector embeddings are created, the cosine similarity is computed. Two proportional vectors have a cosine similarity of 1, and two orthogonal vectors have a similarity of 0. Within the range of 0–1, a higher score is considered better. We see AI21 Labs Jurassic-2 Ultra at the top of the list again, followed by Anthropic Claude v2, OpenAI GPT-4, Cohere Command, and OpenAI GPT-3.5 Turbo.

No alt text provided for this image
Mean Semantic Similarity (Cosine Similarity) Scores for all Responses

9. Mean Response?Length

Taking the mean of SacreBLEU’s response length (hyp_len), which measures the number of words in all nine prompt responses, it is clear that some models are consistently terse, like Google Flan-T5 XL. Similarly, we can observe that other models are consistently more verbose, such as Anthropic Claude v1 and Cohere Command Light. Responses ranged in length from 33% to 300% of the Baseline Accepted Responses. Cohere Command, Anthropic Claude Instant v1, and OpenAI GPT-3.5 Turbo were closest to the mean of the accepted responses.

No alt text provided for this image
Mean Response Lengths for all Responses

The results do not account for occasions when the model did not respond, giving a default “don’t know” (hyp_len = 2). Failing to respond substantially decreased the mean response length for the AI21 Labs Jurassic-2 Mid and Amazon Titan Text Large models. Similarly, as noted earlier, the Cohere Command Light model demonstrated the repetition problem in at least two responses, substantially increasing its mean response length.

10. Sum of Response?Failures

Out of a total of 117 responses (3 prompts x 3 Kendra indexes x 13 models), there were only five occasions, or 4.2% of the time, when a model failed to return a response (defaulting to “don’t know”). Only two of the 13 models tested, AI21 Labs Jurassic-2 Mid and Amazon Titan Text Large, failed to return responses during testing.

No alt text provided for this image
Sum of Response Failures

11. Sum of Responses Outside Supplied Contextual References

To determine if any of the selected models would attempt to respond to a prompt for which the response fell outside of the supplied contextual reference, the following questions were asked of each LLM, based on Index 1: Amazon SageMaker Documentation (official online docs):

  • What is the best recipe for chocolate cake?
  • Who won the FIFA World Cup?
  • What is the name of Han Solo’s ship?
  • How long does it take the earth to do one full rotation of the sun?
  • What woman discovered radium and polonium?
  • What is the name of Dorothy’s dog in The Wizard of Oz”?
  • How should I invest one thousand dollars for the quickest returns?
  • What is the capital of France?
  • Is the moon made of cheese?

The prompt template used in this post is fairly standard, with most of the content coming from LangChain. It clearly states that answers should only come from the provided context (documents).

prompt_template = """
The following is a friendly conversation between a human and an AI.
The AI is talkative and provides lots of specific details from its
context. If the AI does not know the answer to a question, it
truthfully says it does not know.
{context}
Instruction: Based on the above documents, provide a detailed answer
for, {question}. 
Answer "don't know" if not present in the documents. 
Solution:"""        

While nine questions are entirely arbitrary, surprisingly, the results show that more than 50% of the models (7 of 13) provided answers outside the supplied contextual reference (~23% of the time). While two of the seven models only exhibited this behavior once, other models, like the Google Flan-T5 and Cohere Command model series, did it much more frequently. On a positive note, three families of models did not respond when the response fell outside the contextual reference. The Amazon, OpenAI, and Anthropic models all exhibited this quality. Based on your use case, you will need to decide if you can tolerate the risk.

No alt text provided for this image
Sum of Responses Outside Supplied Contextual References

Conclusion

Again, this post’s approach to testing LLMs is specific to RAG-based question-answering chatbot architecture and the technologies, indexed content, prompt template, prompts, LLM parameters, and versions of software packages used. The test results are not meant to provide a definitive opinion on the quality of the LLMs tested. Your Generative AI use-case(s), the results you get, and your qualitative and quantitative requirements will differ.

That being said, based on my qualitative and quantitative evaluations, I would rank the following models as performing best within the cohort of 13 models tested for RAG-based question-answering chatbots:

* Caveat: this model may respond to non-contextual questions (see ’10. Sum of Non-Contextual References’)


This blog represents my viewpoints and not those of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.

This is gold for me right now, Thanks!

回复
Dr. Tural Sad?k

Senior Salesforce Developer at Guidion

1 年

What dataset is this?

Jasdeep Sidhu, Ph.D.

Technical Founder | Senior ML Scientist | Physicist | JPL/NASA Scholar

1 年

Impressive evaluation. Using BERTScore to quantify semantic similarity seems like a clever way to objectively benchmark relevance across models. Valuable insights on assessing contextual understanding.

Jonathan Sims

ML | Analytics | Data Management

1 年

要查看或添加评论,请登录

社区洞察

其他会员也浏览了