Quantitative Evaluation of LLM Responses with RAG-based Question-answering Chatbots
Gary Stafford
Principal Solutions Architect @AWS | Data Analytics and Generative AI Specialist | Experienced Technology Leader, Consultant, CTO, COO, President | 10x AWS Certified
Examining quantitative metrics, including sacreBLEU, TER, ChrF, ChrF++, BERTScore, ROUGE, METEOR, and Semantic Similarity to evaluate and compare the quality of responses from more than a dozen popular Large Language Models?(LLMs)
To quote LangChain’s recent post, Conversational Retrieval Agents, “LLMs only know what they are trained on. To combat this, a style of generation known as “retrieval augmented generation” has emerged. In this technique, documents are retrieved and then inserted into the prompt, and the language model is instructed to only respond based on those documents. This helps both in giving the language model additional context as well as in keeping it grounded.” Similarly, to paraphrase Meta’s original September 2020 post, Retrieval Augmented Generation: Streamlining the creation of intelligent natural language processing models, “Retrieval Augmented Generation (RAG) has two sources of knowledge: the knowledge that seq2seq models store in their parameters (parametric memory) and the knowledge stored in the corpus from which RAG retrieves passages (nonparametric memory). Further, these two sources complement each other. We [Meta] found that RAG uses its nonparametric memory to “cue” the seq2seq model into generating correct responses, essentially combining the flexibility of the “closed-book” or parametric-only approach with the performance of “open-book” or retrieval-based methods.”
When building RAG-based Generative AI applications, such as conversational and question-answering chatbots, an endlessly expanding choice of LLMs (aka Foundation Models) is available. With so many LLM options, how do we qualitatively and quantitatively evaluate and compare the responses from multiple LLMs to a single user prompt consistently and repeatedly? This blog post will explore several quantitative metrics to compare and contrast the responses (answers) from over a dozen currently popular LLMs to individual user prompts (questions) when using RAG-based architectures.
Since this post is focused on RAG-based generative responses, the responses must be not only accurate but accurate to the nonparametric memory (supplied contextual reference). Contextual reference increases LLM response accuracy and helps negate hallucinations. For this blog, the contextual reference will be collections of web-based documents indexed using Amazon Kendra. Responses to user prompts should be based on the corpus of knowledge (relevant documents) contained in the Kendra indexes. The ability to understand, extract, and synthesize the relevant document’s contents and formulate a coherent response is the responsibility of the LLM and its parametric memory.
Although there are many similarities in the types of Generative AI applications organizations are currently building, each organization’s contextual reference (intellectual property, organizational knowledge, corpus of specific content), and the desired communication style as personified in the generative AI application (tone, tenor, pace, mood, voice, syntax, diction, and length of response) are uniquely different. This makes globally evaluating responses beyond just simple correctness and completeness difficult. Ultimately, each organization must decide if they feel the quality of response they are getting from the LLM meets their requirements.
Large Language?Models
The following LLMs were examined as part of the post, in alphabetical order:
Testing Approach
Disclaimer: This post’s approach to testing LLMs is specific to RAG-based question-answering chatbots and the technologies, indexed content, prompt template, prompts, LLM parameters, and versions of software packages used. The test results are not meant to provide a definitive opinion on the quality of the LLMs tested. Your use cases, results, and qualitative and quantitative requirements will differ.
Preview of Automated LLM Quantitative Evaluation Test Harness (No Audio)
Variability
Many variables impact the response of an LLM, including:
This post will focus on the choice of model to impact the response’s quality. The prompt, prompt template, and supplied contextual reference of relevant documents passed to the models from Amazon Kendra will be identical for each prompt.
Model Parameters
Available parameters and parameter scales vary across models. All model temperatures will be set to approximately zero. Since not all models accept zero (0) as a temperature value, all model temperatures will be set to 0.0000000001 (1e-10 or 10^(–10)). A temperature of zero will ensure consistent responses to the same prompt. We will somewhat arbitrarily limit the maximum number of tokens in the response size to 1,024 since most accepted responses are well within this limit. The tests will rely on the model’s default values for the remaining model parameters.
Kendra Indexes and?Prompts
To compare and contrast the responses from multiple LLMs and explore different metrics, the post uses three different prompts for three different Amazon Kendra indexes for a total of nine different prompts or 117 responses (9 responses/model x 13 models). Each response will be compared to what was determined, subjectively, to be the correct response(s) for the prompt. All three indexes were built using a web crawler and a pre-defined sitemap to precisely control what content was indexed and made available to the LLM.
Index 1: Amazon SageMaker Documentation (official online docs)
Index 2: Physical Geology — 2nd Edition (PDF-based book from opentextbc.ca)
Index 3: Investor.gov Investment Guidance (sponsored by SEC.gov)
Quantitative Metrics
There are many different quantitative metrics we could use to evaluate and compare the responses from multiple LLMs. Ultimately, none of the chosen ones will likely mirror your own qualitative opinion. The goal is to find correlations between certain quantitative metrics and your qualitative opinions. The quantitative metrics used in this post are:
Final results can be found toward the end of this post, see: Quantitative Evaluation Results. Quantitative metrics not measured in this post but equally important when choosing an LLM are model bias, maximum tokens, performance (latency), cost, and licensing.
1. BiLingual Evaluation Understudy (BLEU)
According to Wikipedia, BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text that has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human: “the closer a machine translation is to a professional human translation, the better it is” — this is the central idea behind BLEU. Invented at IBM in 2001, BLEU was one of the first metrics to claim a high correlation with human judgments of quality and remains one of the most popular automated and inexpensive metrics. According to Machine Translate, BLEU calculates the similarity between a machine translation output and a reference translation using n-gram precision.
SacreBLEU
According to the project’s GitHub repository, SacreBLEU (Post, 2018) provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official WMT scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization.
# Snippet of SacreBLEU implementation for testing RAG-based Q/A Chatbot
# Author: Gary A. Stafford
# Date: 2023-08-24
# Reference: https://github.com/mjpost/sacrebleu#using-sacrebleu-from-python
models = []
hyp_len = []
bleu_metrics = []
ter_metrics = []
chrf_metrics = []
chrfpp_metrics = []
def calculate_bleu_score(model, response, ref):
reference = [ref]
hypothesis = [response]
models.append(model)
bleu = BLEU()
result = bleu.corpus_score(hypothesis, reference)
hyp_len.append(str(result).split(" ")[12])
bleu_metrics.append(str(result).split(" ")[2])
print(f"Model: {model}\n{result}")
ter = TER()
result = ter.corpus_score(hypothesis, reference)
ter_metrics.append(str(result).split(" ")[2])
print(result)
chrf = CHRF()
result = chrf.corpus_score(hypothesis, reference)
chrf_metrics.append(str(result).split(" ")[2])
print(result)
chrfpp = CHRF(word_order=2)
result = chrfpp.corpus_score(hypothesis, reference)
chrfpp_metrics.append(str(result).split(" ")[2])
print(f"{result}\n")
2. Translation Error Rate?(TER)
According to Machine Translate, TER (Translation Error Rate) is a metric for automatic evaluation of machine translation that calculates the number of edits required to change a machine translation output into one of the references. We will use SacreBLEU’s TER implementation.
3. CHaRacter-level F-score?(ChrF)
According to Machine Translate, ChrF (CHaRacter-level F-score) is a metric for machine translation evaluation that calculates the similarity between a machine translation output and a reference translation using character n-grams, not word n-grams. Metrics based on word n-grams are especially problematic for high-morphology languages (we are sticking with English, which has a less complex morphology). We will use SacreBLEU’s ChrF implementation.
4. ChrF++
According to HuggingFace, ChrF and ChrF++ are two MT evaluation metrics. They both use the F-score statistic for character n-gram matches, and ChrF++ adds word n-grams, which correlates more strongly with direct assessment. We will use SacreBLEU’s ChrF++ implementation.
5. BERTScore
According to Machine Translate, BERTScore is an evaluation metric using BERT (Bidirectional Encoder Representations from Transformers) sentence representations. BERTScore is a metric for automatic evaluation of machine translation that calculates the similarity between a machine translation output and a reference translation using sentence representation. BERTScore was invented as an improvement on n-gram-based metrics like BLEU.
# Snippet of BERTScore implementation for testing RAG-based Q/A Chatbot
# Author: Gary A. Stafford
# Date: 2023-08-24
# Reference: https://github.com/Tiiiger/bert_score#python-function
from bert_score import BERTScorer
bert_score_p_metrics = []
bert_score_r_metrics = []
bert_score_f1_metrics = []
def calculate_bert_score(model, response, ref):
reference = [[ref]]
hypothesis = [response]
scorer = BERTScorer(lang="en", rescale_with_baseline=True)
P, R, F1 = scorer.score(hypothesis, reference)
bert_score_p_metrics.append(P[0].item())
bert_score_r_metrics.append(R[0].item())
bert_score_f1_metrics.append(F1[0].item())
print(f"Model: {model}\nPrecision: {P[0].item():.3f}\nRecall: {R[0].item():.3f}\nF1: {F1[0].item():.3f}\n")
6. Recall-Oriented Understudy for Gisting Evaluation (ROUGE)
According to the paper, ROUGE: A Package for Automatic Evaluation of Summaries (Lin, 2004), ROUGE includes several automatic evaluation methods that measure the similarity between summaries. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-grams, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S. We will use two ROUGE measures in the post:
# Snippet of ROUGE implementation for testing RAG-based Q/A Chatbot
# Author: Gary A. Stafford
# Date: 2023-08-24
# Reference: https://github.com/pltrdy/rouge#as-a-library
from rouge import Rouge
import json
rouge_1_f_metrics = []
rouge_2_f_metrics = []
rouge_3_f_metrics = []
rouge_l_f_metrics = []
def calculate_rouge_score(model, response, ref):
reference = ref
hypothesis = response
rouge = Rouge(metrics=["rouge-1", "rouge-2", "rouge-3", "rouge-l"])
scores = rouge.get_scores(hypothesis, reference, avg=True)
rouge_1_f_metrics.append(scores["rouge-1"]["f"])
rouge_2_f_metrics.append(scores["rouge-2"]["f"])
rouge_3_f_metrics.append(scores["rouge-3"]["f"])
rouge_l_f_metrics.append(scores["rouge-l"]["f"])
print(f"Model: {model}\nScores: {json.dumps(scores, indent=2)}\n")
7. Metric for Evaluation of Translation with Explicit ORdering?(METEOR)
According to Wikipedia, METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a metric for the evaluation of machine-translation output. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce a good correlation with human judgment at the sentence or segment level. This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.
# Snippet of METEOR implementation for testing RAG-based Q/A Chatbot
# Author: Gary A. Stafford
# Date: 2023-08-27
# Reference: https://www.nltk.org/howto/meteor.html
import nltk
from nltk.translate import meteor
from nltk import word_tokenize
nltk.download('punkt')
nltk.download('wordnet')
meteor_metrics = []
def calculate_meteor_score(model, answer, ref):
reference = ref
hypothesis = answer
score = meteor(
[word_tokenize(reference)],
word_tokenize(hypothesis)
)
meteor_metrics.append(score)
print(f"Model: {model}\nScore: {score:.3f}\n")
领英推荐
8. Semantic Textual Similarity
Semantic Textual Similarity is the task of evaluating how similar two texts are in terms of meaning. In this post, dense vector embeddings of 768 dimensions are created for the accepted response and each model’s response using one of SentenceTransformer’s pre-trained sentence-transformer models. Then, cosine similarity, a measure of similarity between two non-zero vectors defined in an inner product space, is computed from the pair of embeddings, again using SentenceTransformer. The cosine similarity always belongs to the interval [0,1] — two proportional vectors have a cosine similarity of 1, while two orthogonal vectors have a similarity of 0.
# Snippet of Semantic Similarity implementation for testing RAG-based Conversational Chatbot
# Author: Gary A. Stafford
# Date: 2023-08-24
from sentence_transformers import SentenceTransformer, util
similarity_metrics = []
def compute_cosine_similarity(model, response, ref):
# choose SentenceTransformer model
model = SentenceTransformer("all-mpnet-base-v2")
# compute dense vector embeddings (768 dimensions)
reference_embedding = model.encode(ref, convert_to_tensor=True)
response_embedding = model.encode(response, convert_to_tensor=True)
# compute cosine similarity
cosine_score = util.cos_sim(reference_embedding, response_embedding)
similarity_metrics.append(cosine_score[0].item())
print(f"Model: {model}\nCosine Similarity Score: {cosine_score[0].item():.3f}\n")
9. Response?Length
Response length will vary significantly across models for the same prompt. The n-gram, word, or character length is not a consistent indicator of accuracy when judging the quality of a response. However, a terse or very long response might suggest issues with the response. A terse response may indicate that the model could not generate a response (e.g., “don’t know” — the default response for this post) or that the response was incomplete given the context of the prompt. A very long response may indicate that the model was unnecessarily verbose in its response, or that the response demonstrated the repetition problem — a repetition of subsequences (arXiv:2012.14660). We will use SacreBLEU’s word length implementation to compare the token length of the accepted response with the responses of the different models.
10. No?Response
Based on the prompt template used for the tests, if an LLM does not know the answer, it is instructed to return the response, “don’t know.” Given that the accepted responses for all nine prompts were clearly within the supplied contextual reference passed to the LLM, we would not expect any response failures (“don’t know”). We will measure how many times the LLMs failed to respond with an answer.
11. Response Outside Supplied Contextual Reference
Using RAG, the assumption is that the response should be based on the supplied contextual reference passed to the LLM. The LLM should not attempt to respond to the prompt based on its parametric memory, the broad corpus of textual data the LLM was trained on. Why is generating responses outside the supplied contextual reference potentially harmful? Imagine creating a chatbot to provide financial education based on the specific content you have developed. However, to questions like “How should I invest one thousand dollars for the quickest returns?” your chatbot starts giving poor, biased, or potentially illicit financial advice. We will determine if any models will attempt to respond to a non-contextual prompt using the model’s parametric memory instead of the supplied contextual reference (nonparametric memory).
Qualitative Human Evaluation Results
Upon cursory analysis, all responses were accurate to the contextual references provided with each prompt. However, some anomalies were observed in the responses:
Quantitative Evaluation Results
1. Median BLEU?Scores
Recall that BLEU calculates the similarity between a machine translation output and a reference translation using n-gram precision. A higher score is considered better within a range of 0–100. Based on the median BLEU score of all nine prompt responses, OpenAI GPT-4 was in the top position, closely followed by OpenAI GPT-3.5 Turbo. They are followed by Meta Llama-2 13B Chat HF, Cohere Command, and Cohere Command Light. Median was used instead of mean due to some clear outliers in the results.
2. Median TER?Scores
Recall that TER is a metric for automatic evaluation of machine translation that calculates the number of edits required to change a machine translation output into one of the references. A lower score is considered better, meaning fewer edits are needed. Taking the median TER scores of all nine prompt responses, AI21 Labs Jurassic-2 Ultra, Cohere Command, OpenAI GPT-3.5 Turbo, Google Flan-T5 XXL FP16, and OpenAI GPT-4 all finished in the top five.
Note the high score of the Cohere Command Light model. The model demonstrated a repetition problem, discussed earlier, in a few instances. For example (abridged): “SageMaker also provides a set of tools for analyzing and evaluating the performance of the trained models for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps for MLOps…” This caused the response length and TER score to balloon.
3. Mean ChrF?Scores
Recall that ChrF measures the similarity between a machine translation output and a reference translation using character n-grams, not word n-grams. A higher score is considered better within a range of 0–100. Taking the mean of the ChrF scores of all nine prompt responses, OpenAI GTP-4, OpenAI GPT-3.5 Turbo, AI21 Labs Jurassic-2 Ultra, Meta Llama-2 13B Chat HF, and Cohere Command were in the top five.
4. Mean ChrF++?Scores
Recall that both metrics use the F-score statistic for character n-gram matches, while ChrF++ adds word n-grams, which correlates more strongly with direct assessment. We see nearly identical results with both metrics based on the mean of the scores.
5. Mean BERTScore F1?Scores
Recall that BERTScore calculates the similarity between a machine translation output and a reference translation using sentence representation. The post’s BERTScore function returns Precision, Recall, and the F1 score. We will look at the mean of the F1 score. According to Wikipedia, the F1 score is the harmonic mean of precision and recall. It thus symmetrically represents both precision and recall in one metric. A higher score is considered better within a range of 0–1. Taking the mean F1 BERTScore for all nine prompt responses, AI21 Labs Jurassic-2 Ultra was at the top of the list, well ahead of Cohere Command in second position. They were followed by Google Flan-T5 XXL FP16, OpenAI GPT-3.5 Turbo, and OpenAI GPT-4.
6. Mean ROUGE-1, ROUGE-2, and ROUGE-L F1?Scores
Recall that ROUGE-N is an n-gram recall between a candidate summary and a set of reference summaries, where n is the n-gram length. ROUGE-L calculates the Longest Common Subsequence (LCS) from two sequences, X and Y. Within a range of 0–1, a higher score is considered better. Taking the mean of the ROUGE-1, ROUGE-2, and ROUGE-L F1 scores of all nine prompt responses and sorting by ROUGE-1, we see AI21 Labs Jurassic-2 Ultra at the top of the list again, followed by OpenAI GPT-4, Google Flan-T5 XXL FP16, OpenAI GPT-3.5 Turbo, and Cohere Command.
Taking the mean of the ROUGE-1, ROUGE-1, and ROUGE-L F1 scores of all nine prompt responses, and this time sorting by ROUGE-L, we see the same five models, in a slightly different order: AI21 Labs Jurassic-2 Ultra, followed by Google Flan-T5 XXL FP16, OpenAI GPT-4, Cohere Command, and OpenAI GPT-3.5 Turbo.
7. Mean METEOR?Scores
Recall that METEOR is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. Within the range of 0–1, a higher score is considered better. We see OpenAI GPT-4 and OpenAI GPT-3.5 Turbo at the top of the list, followed by Cohere Command, AI21 Labs Jurassic-2 Ultra, Meta Llama-2 13B Chat HF, and Anthropic Claude v2.
8. Mean Semantic Similarity (using Cosine Similarity)
Recall that semantic similarity evaluates how similar two texts are in terms of meaning. After dense vector embeddings are created, the cosine similarity is computed. Two proportional vectors have a cosine similarity of 1, and two orthogonal vectors have a similarity of 0. Within the range of 0–1, a higher score is considered better. We see AI21 Labs Jurassic-2 Ultra at the top of the list again, followed by Anthropic Claude v2, OpenAI GPT-4, Cohere Command, and OpenAI GPT-3.5 Turbo.
9. Mean Response?Length
Taking the mean of SacreBLEU’s response length (hyp_len), which measures the number of words in all nine prompt responses, it is clear that some models are consistently terse, like Google Flan-T5 XL. Similarly, we can observe that other models are consistently more verbose, such as Anthropic Claude v1 and Cohere Command Light. Responses ranged in length from 33% to 300% of the Baseline Accepted Responses. Cohere Command, Anthropic Claude Instant v1, and OpenAI GPT-3.5 Turbo were closest to the mean of the accepted responses.
The results do not account for occasions when the model did not respond, giving a default “don’t know” (hyp_len = 2). Failing to respond substantially decreased the mean response length for the AI21 Labs Jurassic-2 Mid and Amazon Titan Text Large models. Similarly, as noted earlier, the Cohere Command Light model demonstrated the repetition problem in at least two responses, substantially increasing its mean response length.
10. Sum of Response?Failures
Out of a total of 117 responses (3 prompts x 3 Kendra indexes x 13 models), there were only five occasions, or 4.2% of the time, when a model failed to return a response (defaulting to “don’t know”). Only two of the 13 models tested, AI21 Labs Jurassic-2 Mid and Amazon Titan Text Large, failed to return responses during testing.
11. Sum of Responses Outside Supplied Contextual References
To determine if any of the selected models would attempt to respond to a prompt for which the response fell outside of the supplied contextual reference, the following questions were asked of each LLM, based on Index 1: Amazon SageMaker Documentation (official online docs):
The prompt template used in this post is fairly standard, with most of the content coming from LangChain. It clearly states that answers should only come from the provided context (documents).
prompt_template = """
The following is a friendly conversation between a human and an AI.
The AI is talkative and provides lots of specific details from its
context. If the AI does not know the answer to a question, it
truthfully says it does not know.
{context}
Instruction: Based on the above documents, provide a detailed answer
for, {question}.
Answer "don't know" if not present in the documents.
Solution:"""
While nine questions are entirely arbitrary, surprisingly, the results show that more than 50% of the models (7 of 13) provided answers outside the supplied contextual reference (~23% of the time). While two of the seven models only exhibited this behavior once, other models, like the Google Flan-T5 and Cohere Command model series, did it much more frequently. On a positive note, three families of models did not respond when the response fell outside the contextual reference. The Amazon, OpenAI, and Anthropic models all exhibited this quality. Based on your use case, you will need to decide if you can tolerate the risk.
Conclusion
Again, this post’s approach to testing LLMs is specific to RAG-based question-answering chatbot architecture and the technologies, indexed content, prompt template, prompts, LLM parameters, and versions of software packages used. The test results are not meant to provide a definitive opinion on the quality of the LLMs tested. Your Generative AI use-case(s), the results you get, and your qualitative and quantitative requirements will differ.
That being said, based on my qualitative and quantitative evaluations, I would rank the following models as performing best within the cohort of 13 models tested for RAG-based question-answering chatbots:
* Caveat: this model may respond to non-contextual questions (see ’10. Sum of Non-Contextual References’)
This blog represents my viewpoints and not those of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.
Software Engineer
7 个月This is gold for me right now, Thanks!
Senior Salesforce Developer at Guidion
1 年What dataset is this?
Technical Founder | Senior ML Scientist | Physicist | JPL/NASA Scholar
1 年Impressive evaluation. Using BERTScore to quantify semantic similarity seems like a clever way to objectively benchmark relevance across models. Valuable insights on assessing contextual understanding.
ML | Analytics | Data Management
1 年Kyle Tobin ??