Perplexity and its friends - a quick tour of language model evaluation metrics
Julian Kaljuvee
Agentic AI / ML Engineering @Microsoft, Ex-quant (Goldman, JPMorgan, LSEG, UBS)│ Alternative Data and Gen AI
In the domain of Natural Language Processing (NLP), understanding and evaluating the performance of language models is essential for developing robust and reliable applications. This guide will explore several key concepts and metrics, each offering unique insights into the model's behavior and output quality which is important when both building models from scratch or fine tuning them for your particular purpose.
In particular, we cover perplexity and its friends - other model evaluation metrics - with some code examples while giving intuition how to interpret them:
Let's get started!
1. Log Probabilities
Intuition: Log probabilities give a more granular view of the model's predictions, indicating the likelihood of each token in the sequence. Negative log probabilities closer to zero indicate higher confidence.
Interpretation:
Code Example
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
def get_log_probs(model, tokenizer, prompt):
inputs = tokenizer(prompt, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
log_probs = outputs.logits.log_softmax(dim=-1)
return log_probs
def main():
model_name = "gpt2" # Replace with your model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "Once upon a time"
log_probs = get_log_probs(model, tokenizer, prompt)
print(log_probs)
if __name__ == "__main__":
main()
2. Perplexity
Intuition: Perplexity is a measure of how well a language model predicts a sample. It is calculated as the exponentiation of the average negative log likelihood of a sequence. Lower perplexity indicates better performance. Put it simply, perplexity is a measurement of how well a probability model predicts a sample. In the context of Natural Language Processing, perplexity is one way to evaluate language models.
Interpretation:
Code Example
import math
def calculate_perplexity(log_probs):
n = log_probs.shape[1]
log_probs_sum = log_probs.sum()
perplexity = math.exp(-log_probs_sum / n)
return perplexity
def main():
model_name = "gpt2" # Replace with your model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "Once upon a time"
log_probs = get_log_probs(model, tokenizer, prompt)
perplexity = calculate_perplexity(log_probs)
print(f"Perplexity: {perplexity}")
if __name__ == "__main__":
main()
3. Ranked Predictions (Top-k or Top-n Predictions)
Intuition: Top-k predictions show the most likely next words or tokens according to the model's output. By looking at the top k predictions, you can get a sense of the model's confidence and diversity in its possible continuations.
Interpretation:
Code Example
def get_top_k_predictions(log_probs, tokenizer, k=5):
top_k = torch.topk(log_probs, k, dim=-1)
predictions = [tokenizer.decode([idx]) for idx in top_k.indices[0][0]]
probabilities = top_k.values[0][0].exp().tolist()
return list(zip(predictions, probabilities))
def main():
model_name = "gpt2" # Replace with your model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "Once upon a time"
log_probs = get_log_probs(model, tokenizer, prompt)
top_k_predictions = get_top_k_predictions(log_probs, tokenizer, k=5)
print(top_k_predictions)
if __name__ == "__main__":
main()
4. Confidence Scores
Intuition: Confidence scores represent how certain the model is about its predictions. These scores are derived from the softmax probabilities of the model’s logits.
Interpretation:
Code Example
def get_confidence_scores(log_probs):
probs = log_probs.exp()
confidence_scores = torch.max(probs, dim=-1).values
return confidence_scores
def main():
model_name = "gpt2" # Replace with your model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "Once upon a time"
log_probs = get_log_probs(model, tokenizer, prompt)
confidence_scores = get_confidence_scores(log_probs)
print(confidence_scores)
if __name__ == "__main__":
main()
领英推荐
5. Sampling Techniques (Temperature Sampling)
Intuition: Temperature sampling controls the randomness of the predictions. Lower temperatures make the model's output more deterministic (conservative), while higher temperatures introduce more randomness (creativity).
Interpretation:
Code Example
def sample_with_temperature(log_probs, temperature):
log_probs /= temperature
probs = torch.softmax(log_probs, dim=-1)
sampled_index = torch.multinomial(probs, num_samples=1)
return sampled_index
def main():
model_name = "gpt2" # Replace with your model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors='pt')
log_probs = get_log_probs(model, tokenizer, prompt)
temperature = 1.2
sampled_index = sample_with_temperature(log_probs, temperature)
sampled_token = tokenizer.decode(sampled_index)
print(sampled_token)
if __name__ == "__main__":
main()
6. Alternative Metrics (BLEU, ROUGE, METEOR)
Intuition: These metrics are used to evaluate the quality of text generation, particularly for tasks like translation and summarization. They compare the generated text to reference texts to measure similarity.
Interpretation:
Code Example (BLEU):
from nltk.translate.bleu_score import sentence_bleu
def calculate_bleu_score(reference, candidate):
reference = [reference.split()]
candidate = candidate.split()
score = sentence_bleu(reference, candidate)
return score
def main():
reference_text = "The quick brown fox jumps over the lazy dog"
generated_text = "The fast brown fox leaps over the sleepy dog"
bleu_score = calculate_bleu_score(reference_text, generated_text)
print(f"BLEU Score: {bleu_score}")
if __name__ == "__main__":
main()
Summary
In summary, we covered above the following metrics.
1. Log Probabilities
Log probabilities provide a detailed look at the likelihood of each token in a sequence. By examining these values, developers can gauge the model's confidence in its predictions. High (less negative) log probabilities indicate greater confidence, while low (more negative) values suggest uncertainty. The ability to compare log probabilities helps in choosing the most probable token, thereby enhancing the model's reliability.
2. Perplexity
Perplexity is a fundamental metric for evaluating language models, reflecting how well the model predicts a sample. Lower perplexity values signify better predictive performance, making it a crucial indicator during model training and evaluation. By calculating perplexity, developers can identify areas where the model may need improvement.
3. Ranked Predictions (Top-k or Top-n Predictions)
Examining the top-k predictions provides insight into the model's confidence and the diversity of its possible continuations. High rank concentration around one or two predictions suggests strong confidence, while a spread of probabilities indicates multiple plausible continuations. This approach helps in understanding the range of predictions the model considers most likely.
4. Confidence Scores
Confidence scores derived from softmax probabilities reflect how certain the model is about its predictions. High confidence scores indicate strong certainty, while low scores reveal potential ambiguities or the need for more training data. These scores are essential for assessing the model's reliability in various scenarios.
5. Sampling Techniques (Temperature Sampling)
Temperature sampling introduces a mechanism to control the randomness of the model's output. By adjusting the temperature parameter, developers can balance between deterministic and creative outputs. Lower temperatures result in more predictable and focused text, whereas higher temperatures foster diversity and creativity, making this technique valuable for generating varied and engaging content.
6. Alternative Metrics (BLEU, ROUGE, METEOR)
Evaluating the quality of generated text, especially in tasks like translation and summarization, requires robust metrics. BLEU, ROUGE, and METEOR scores offer different perspectives on the similarity between generated and reference texts. These metrics help in fine-tuning models to produce more accurate and meaningful outputs.
When building or finetuning models, there are plenty of metrics to choose from!
Resources