登录查看更多内容

Perplexity and its friends - a quick tour of language model evaluation metrics

Julian Kaljuvee

Agentic AI / ML Engineering @Microsoft, Ex-quant (Goldman, JPMorgan, LSEG, UBS)│ Alternative Data and Gen AI

发布日期: 2024年7月2日

In the domain of Natural Language Processing (NLP), understanding and evaluating the performance of language models is essential for developing robust and reliable applications. This guide will explore several key concepts and metrics, each offering unique insights into the model's behavior and output quality which is important when both building models from scratch or fine tuning them for your particular purpose.

In particular, we cover perplexity and its friends - other model evaluation metrics - with some code examples while giving intuition how to interpret them:

Log probabilities
Perplexity
Ranked predictions
Confidence scores
Sampling techniques (temperature sampling)
Alternative metrics (BLEU, ROUGE, METEOR)

Let's get started!

1. Log Probabilities

Intuition: Log probabilities give a more granular view of the model's predictions, indicating the likelihood of each token in the sequence. Negative log probabilities closer to zero indicate higher confidence.

Interpretation:

High (Less Negative) Log Probabilities: Indicates high confidence in the prediction. For example, a log probability of -0.1 means the model is very confident.
Low (More Negative) Log Probabilities: Indicates lower confidence. For instance, a log probability of -5 means the model is much less confident about that prediction.
Comparison of Log Probabilities: By comparing log probabilities of different tokens, you can understand the model's preference for one token over another.

Code Example

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

def get_log_probs(model, tokenizer, prompt):
    inputs = tokenizer(prompt, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
    log_probs = outputs.logits.log_softmax(dim=-1)
    return log_probs

def main():
    model_name = "gpt2"  # Replace with your model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    prompt = "Once upon a time"
    log_probs = get_log_probs(model, tokenizer, prompt)
    print(log_probs)

if __name__ == "__main__":
    main()

2. Perplexity

Intuition: Perplexity is a measure of how well a language model predicts a sample. It is calculated as the exponentiation of the average negative log likelihood of a sequence. Lower perplexity indicates better performance. Put it simply, perplexity is a measurement of how well a probability model predicts a sample. In the context of Natural Language Processing, perplexity is one way to evaluate language models.

Interpretation:

Low Perplexity: Indicates that the model predicts the sequence well and is confident in its predictions.
High Perplexity: Suggests the model struggles to predict the sequence accurately, indicating uncertainty or poor performance.

Code Example

import math

def calculate_perplexity(log_probs):
    n = log_probs.shape[1]
    log_probs_sum = log_probs.sum()
    perplexity = math.exp(-log_probs_sum / n)
    return perplexity

def main():
    model_name = "gpt2"  # Replace with your model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    prompt = "Once upon a time"
    log_probs = get_log_probs(model, tokenizer, prompt)
    perplexity = calculate_perplexity(log_probs)
    print(f"Perplexity: {perplexity}")

if __name__ == "__main__":
    main()

3. Ranked Predictions (Top-k or Top-n Predictions)

Intuition: Top-k predictions show the most likely next words or tokens according to the model's output. By looking at the top k predictions, you can get a sense of the model's confidence and diversity in its possible continuations.

Interpretation:

High Rank Concentration: If the top 1 or 2 predictions have significantly higher probabilities than the rest, it suggests the model is confident about those specific continuations.
Spread Out Probabilities: If the probabilities are more evenly spread among the top k predictions, it indicates more uncertainty or multiple valid continuations.

Code Example

def get_top_k_predictions(log_probs, tokenizer, k=5):
    top_k = torch.topk(log_probs, k, dim=-1)
    predictions = [tokenizer.decode([idx]) for idx in top_k.indices[0][0]]
    probabilities = top_k.values[0][0].exp().tolist()
    return list(zip(predictions, probabilities))

def main():
    model_name = "gpt2"  # Replace with your model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    prompt = "Once upon a time"
    log_probs = get_log_probs(model, tokenizer, prompt)
    top_k_predictions = get_top_k_predictions(log_probs, tokenizer, k=5)
    print(top_k_predictions)

if __name__ == "__main__":
    main()

4. Confidence Scores

Intuition: Confidence scores represent how certain the model is about its predictions. These scores are derived from the softmax probabilities of the model’s logits.

Interpretation:

High Confidence: A high confidence score means the model is very sure about its prediction.
Low Confidence: A low confidence score suggests uncertainty or multiple possible continuations. It can indicate areas where the model might need more training data or where the task is inherently ambiguous.

Code Example

def get_confidence_scores(log_probs):
    probs = log_probs.exp()
    confidence_scores = torch.max(probs, dim=-1).values
    return confidence_scores

def main():
    model_name = "gpt2"  # Replace with your model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    prompt = "Once upon a time"
    log_probs = get_log_probs(model, tokenizer, prompt)
    confidence_scores = get_confidence_scores(log_probs)
    print(confidence_scores)

if __name__ == "__main__":
    main()

领英推荐

The state of the art in natural language generation

Naveen Joshi 5 年前

Evaluating Large Language Models (LLMs)

Dr. Rabi Prasad Padhy 5 个月前

Leveraging LLMLingua for Efficient Inference in Large…

Ananya Ghosh Chowdhury 5 个月前

5. Sampling Techniques (Temperature Sampling)

Intuition: Temperature sampling controls the randomness of the predictions. Lower temperatures make the model's output more deterministic (conservative), while higher temperatures introduce more randomness (creativity).

Interpretation:

Low Temperature (e.g., 0.1): The output is more focused on high-probability predictions, leading to repetitive and deterministic text.
Moderate Temperature (e.g., 0.7): Provides a balance between deterministic and creative outputs, suitable for most applications.
High Temperature (e.g., 1.5): The output is more diverse and creative but might become less coherent and more random.

Code Example

def sample_with_temperature(log_probs, temperature):
    log_probs /= temperature
    probs = torch.softmax(log_probs, dim=-1)
    sampled_index = torch.multinomial(probs, num_samples=1)
    return sampled_index

def main():
    model_name = "gpt2"  # Replace with your model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    prompt = "Once upon a time"
    inputs = tokenizer(prompt, return_tensors='pt')
    log_probs = get_log_probs(model, tokenizer, prompt)

    temperature = 1.2
    sampled_index = sample_with_temperature(log_probs, temperature)
    sampled_token = tokenizer.decode(sampled_index)
    print(sampled_token)

if __name__ == "__main__":
    main()

6. Alternative Metrics (BLEU, ROUGE, METEOR)

Intuition: These metrics are used to evaluate the quality of text generation, particularly for tasks like translation and summarization. They compare the generated text to reference texts to measure similarity.

Interpretation:

BLEU (Bilingual Evaluation Understudy): Measures the precision of n-grams in the generated text compared to reference texts. Higher scores indicate better overlap.High BLEU Score: Indicates that the generated text closely matches the reference text.Low BLEU Score: Suggests significant differences between the generated text and the reference.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Measures recall of n-grams, particularly useful for summarization.High ROUGE Score: Indicates the generated text captures most of the important content from the reference.Low ROUGE Score: Indicates missing important content or being too different from the reference.
METEOR (Metric for Evaluation of Translation with Explicit ORdering): Considers precision, recall, and alignment, focusing on synonyms and stemming.High METEOR Score: Indicates high similarity, considering synonyms and variations.Low METEOR Score: Suggests less similarity or fewer meaningful matches.

Code Example (BLEU):

from nltk.translate.bleu_score import sentence_bleu

def calculate_bleu_score(reference, candidate):
    reference = [reference.split()]
    candidate = candidate.split()
    score = sentence_bleu(reference, candidate)
    return score

def main():
    reference_text = "The quick brown fox jumps over the lazy dog"
    generated_text = "The fast brown fox leaps over the sleepy dog"
    
    bleu_score = calculate_bleu_score(reference_text, generated_text)
    print(f"BLEU Score: {bleu_score}")

if __name__ == "__main__":
    main()

Summary

In summary, we covered above the following metrics.

1. Log Probabilities

Log probabilities provide a detailed look at the likelihood of each token in a sequence. By examining these values, developers can gauge the model's confidence in its predictions. High (less negative) log probabilities indicate greater confidence, while low (more negative) values suggest uncertainty. The ability to compare log probabilities helps in choosing the most probable token, thereby enhancing the model's reliability.

2. Perplexity

Perplexity is a fundamental metric for evaluating language models, reflecting how well the model predicts a sample. Lower perplexity values signify better predictive performance, making it a crucial indicator during model training and evaluation. By calculating perplexity, developers can identify areas where the model may need improvement.

3. Ranked Predictions (Top-k or Top-n Predictions)

Examining the top-k predictions provides insight into the model's confidence and the diversity of its possible continuations. High rank concentration around one or two predictions suggests strong confidence, while a spread of probabilities indicates multiple plausible continuations. This approach helps in understanding the range of predictions the model considers most likely.

4. Confidence Scores

Confidence scores derived from softmax probabilities reflect how certain the model is about its predictions. High confidence scores indicate strong certainty, while low scores reveal potential ambiguities or the need for more training data. These scores are essential for assessing the model's reliability in various scenarios.

5. Sampling Techniques (Temperature Sampling)

Temperature sampling introduces a mechanism to control the randomness of the model's output. By adjusting the temperature parameter, developers can balance between deterministic and creative outputs. Lower temperatures result in more predictable and focused text, whereas higher temperatures foster diversity and creativity, making this technique valuable for generating varied and engaging content.

6. Alternative Metrics (BLEU, ROUGE, METEOR)

Evaluating the quality of generated text, especially in tasks like translation and summarization, requires robust metrics. BLEU, ROUGE, and METEOR scores offer different perspectives on the similarity between generated and reference texts. These metrics help in fine-tuning models to produce more accurate and meaningful outputs.

When building or finetuning models, there are plenty of metrics to choose from!

Resources

Perplexity - intuition and its derivation - Ms Aerin on Medium

要查看或添加评论，请登录

Julian Kaljuvee的更多文章

Agentic AI - Writing Your First LangGraph Agent

2025年2月7日

Agentic AI - Writing Your First LangGraph Agent

The year of 2025 is likely to be the year of Agentic AI, and one of the key new skills in the job market will be the…
What is Retrieval Augmented Generation (RAG)?

2024年6月11日

What is Retrieval Augmented Generation (RAG)?

Introduction Retrieval-Augmented Generation (RAG) is a hybrid approach that combines retrieval-based methods and…

1 条评论
What is LSTM?

2024年5月23日

What is LSTM?

LSTM (Long Short-Term Memory) is a type of recurrent neural network (RNN) architecture that is designed to capture…

1 条评论
What is Transfer Learning?

2024年5月23日

What is Transfer Learning?

Transfer learning is a machine learning technique where a model developed for one task is reused as the starting point…

1 条评论
Outline for an Alternative Data Hedge Fund Strategy

2019年6月18日

Outline for an Alternative Data Hedge Fund Strategy

Imagine one wanted to create a hedge fund strategy based on alternative data, let’s call it ‘AltCap’ and would have to…

1 条评论
Data Science?—?What is Alt Data or Alternative Data?

2019年5月11日

Data Science?—?What is Alt Data or Alternative Data?

Alternative data refers to data used by investors to evaluate a company or investment that is not within their…

1 条评论

See all articles

Perplexity and its friends - a quick tour of language model evaluation metrics

Julian Kaljuvee

Agentic AI / ML Engineering @Microsoft, Ex-quant (Goldman, JPMorgan, LSEG, UBS)│ Alternative Data and Gen AI

1. Log Probabilities

2. Perplexity

3. Ranked Predictions (Top-k or Top-n Predictions)

4. Confidence Scores

领英推荐

5. Sampling Techniques (Temperature Sampling)

6. Alternative Metrics (BLEU, ROUGE, METEOR)

Summary

1. Log Probabilities

2. Perplexity

3. Ranked Predictions (Top-k or Top-n Predictions)

4. Confidence Scores

5. Sampling Techniques (Temperature Sampling)

6. Alternative Metrics (BLEU, ROUGE, METEOR)

Resources

Julian Kaljuvee的更多文章

社区洞察

其他会员也浏览了

Tuning Large Language Models - A Guide for Beginners

Transformers: The Gateway to Natural Language Processing (NLP)

Retrieval Augmented Generation (RAG): A Solution for LLM Hallucinations

Are Voice Recognition and Natural Language Processing Betraying Us?

Revolutionizing Language Models: SOLAR-10.7B and the Innovation of Depth Up-Scaling for Superior Performance

Evaluating the Evolution: Metrics for Large Language Models (#LLMEvaluation #NLP #AI)

How Natural Language Processing is Changing the Way We Communicate Forever

Fine-Tuning Strategies for Large Language Models (LLMs)

Common Misconceptions About Large Language Models (LLMs)

1. Log Probabilities

2. Perplexity

3. Ranked Predictions (Top-k or Top-n Predictions)

4. Confidence Scores

领英推荐

5. Sampling Techniques (Temperature Sampling)

6. Alternative Metrics (BLEU, ROUGE, METEOR)

Summary

1. Log Probabilities

2. Perplexity

3. Ranked Predictions (Top-k or Top-n Predictions)

4. Confidence Scores

5. Sampling Techniques (Temperature Sampling)

6. Alternative Metrics (BLEU, ROUGE, METEOR)

Resources

Julian Kaljuvee的更多文章

Agentic AI - Writing Your First LangGraph Agent

What is Retrieval Augmented Generation (RAG)?

What is LSTM?

What is Transfer Learning?

Outline for an Alternative Data Hedge Fund Strategy

Data Science?—?What is Alt Data or Alternative Data?

社区洞察

其他会员也浏览了

Tuning Large Language Models - A Guide for Beginners

Transformers: The Gateway to Natural Language Processing (NLP)

Retrieval Augmented Generation (RAG): A Solution for LLM Hallucinations

Are Voice Recognition and Natural Language Processing Betraying Us?

Revolutionizing Language Models: SOLAR-10.7B and the Innovation of Depth Up-Scaling for Superior Performance

Evaluating the Evolution: Metrics for Large Language Models (#LLMEvaluation #NLP #AI)

How Natural Language Processing is Changing the Way We Communicate Forever

Fine-Tuning Strategies for Large Language Models (LLMs)

Common Misconceptions About Large Language Models (LLMs)