Evaluating Large Language Models (LLMs) Evaluating an LLM is crucial to understand its performance, capabilities, and limitations. Here are the main ways LLMs are assessed: 1. Accuracy: LLMs are tested for how well they generate correct and relevant responses: - Task Performance: Accuracy in specific tasks like summarization, translation, or answering questions. - Context Understanding: Ability to grasp the meaning of complex or ambiguous inputs. 2. Fluency: How natural and human-like are the responses? Fluency is evaluated by: - Grammar and sentence structure. - Coherence in longer conversations or texts. 3. Relevance: LLMs are scored on whether their outputs are on-topic and appropriate for the input prompt: - Avoiding irrelevant or nonsensical replies. - Staying aligned with user intent. 4. Creativity: For tasks like storytelling or content generation, LLMs are assessed on their ability to produce imaginative and engaging outputs. 5. Safety: An important aspect of evaluation is ensuring that LLMs: - Avoid generating harmful, biased, or inappropriate content. - Respond ethically to sensitive or controversial topics. 6. Speed and Scalability: Performance in real-world scenarios depends on how quickly an LLM can generate responses, especially when scaled to handle millions of users simultaneously. 7. Benchmarking with Standard Datasets: LLMs are evaluated against standard benchmarks like: - GLUE (General Language Understanding Evaluation): Measures comprehension tasks. - SQuAD (Stanford Question Answering Dataset): Tests question-answering accuracy. - BIG-bench: A benchmark for assessing diverse and complex tasks. 8. Human Feedback: Human evaluators rate responses for quality, helping to refine models and ensure outputs align with expectations. Why Evaluation Matters? Careful evaluation ensures LLMs meet the needs of users while minimizing risks. It helps improve models over time, ensuring they remain useful, reliable, and ethical.
IntellWe的动态
最相关的动态
-
Understanding Self-Attention Self-attention is a crucial mechanism in large language models (LLMs) that allows the model to capture the relationships between words in a sentence. Here's a brief breakdown of how this mechanism works: Input Embeddings (X) Each word in the input sequence is converted into a vector representation. These vectors capture the meaning of each word based on its context. For example, the word "journey" might be represented as a vector like [0.4, 0.1, 0.8]. Query, Key, and Value Matrices (Q, K, V) The input vectors are transformed into three distinct matrices: Query (Q): Represents the "focus" of each word. Key (K): Helps determine the relevance of other words. Value (V): Contains the information to be passed along. These transformations enable each word to attend to other words with different levels of focus. For example, the query vector for "journey" might become [0.4, 1.4], reflecting its specific focus. Attention Scores Attention scores are computed by comparing the query of each word with the keys of all other words. This creates an attention matrix that indicates how much focus each word should have on every other word. Higher scores mean stronger attention, allowing the model to dynamically adjust word relationships. Context Vectors (Z) Each word’s context vector is formed by weighting the value vectors according to the attention scores. This step ensures that each word gathers relevant information from other words in the sequence. For instance, the context vector for "journey" might be [0.3, 0.8], reflecting its relationships with other words in the sentence. In LLMs, self-attention enables each token (word) to understand its context and dependencies across the entire sequence. This mechanism is fundamental for models like transformers to process and interpret long sequences, making it essential for understanding meaning, context, and long-range dependencies in language.
要查看或添加评论,请登录
-
-
Hallucinations in Large Language Models Hallucinations are when a large language model generates content that is not grounded in reality. There are two types of hallucinations: in-context and extrinsic. In-context hallucinations are when the model's output is not consistent with the source content. Extrinsic hallucinations are when the model's output is not grounded in the pre-training dataset. There are several methods for detecting hallucinations, including retrieval-augmented evaluation, sampling-based detection, and calibration of unknown knowledge. Retrieval-augmented evaluation involves retrieving relevant documents and then evaluating the model's output for consistency with those documents. Sampling-based detection involves generating multiple samples from the model and looking for inconsistencies between the samples. Calibration of unknown knowledge involves training the model to identify when it is making claims about things that it does not know for sure. RAG (Retrieval-Augmented Generation) retrieves relevant documents and then generates text with those documents as extra context. RARR (Retroactively Adding References for Reasoning) is a framework that allows large language models to attribute their outputs to external evidence. FAVA (Fact-VAlidated Attentive Verification) is another method that retrieves relevant documents and then edits the model output to avoid hallucination errors. What approaches have you tried to counter Hallucinations? Let me know in comments.
要查看或添加评论,请登录
-
I've read a bunch of #Coling2025 interpretability papers so you don't have to (I always wanted to say this!). Since I've chaired an Interpretability and Explainability oral session at?#Coling2025, I?thought I'd do a small recap. - *Enhancing Zero-shot Chain of Thought Prompting via Uncertainty-Guided Strategy Selection* - A method for constructing CoT prompts with examples that would be genuinely hard for the LLM to solve. Automatic (and most importantly, useful!) exemplar generation without manual intervention, pretty neat.?https://lnkd.in/ddyYtwg3 - *Linguistic Minimal Pairs Elicit Linguistic Similarity in Large Language Models* - Very nice paper that shows that LLMs cluster in terms of how they deal internally with "linguistic minimal pairs" (i.e., sentence pairs where there's one specific perturbation). Shows that in Chinese, LLM similarities follow a bimodal distribution, with 2 clear clusters (mono vs bi/multilingual LLMs), and that linguistic similarity sort of correlates with semantic similarity, which means linguistic phenomena are context-dependent.?https://lnkd.in/dKSX48jc - *Part-Of-Speech Sensitivity of Routers in Mixture of Experts Models* - In MoEs, punctuation marks are often routed to few highly specialized experts, this higher specialization tends to happen when the ratio of active vs available experts is smaller.?https://lnkd.in/dtXRnyAr - *Positive Text Reframing under Multi-strategy Optimization* - Looks at positive reframing (keeping the same intent, but with a positive spin, e.g. from "a bad movie" to "a movie that could be improved"). Proposed a combined loss for clf accuracy, content preservation and language modeling. Then, reranking for the optimal generation. Nice and easy!?https://lnkd.in/d6iJDbYz ? - *Explanation Regularisation through the Lens of Attributions* - A truly?mysterious paper, shows that when a model is trained to focus on certain interpretable tokens for a given task (e.g. "the movie was **worth it**", in sentiment analysis), it often "hacks" the training, and actually doesn't focus on these tokens at all if you guide the attention in the last layer only. The model ends optimizing the classification loss, and just in the last layer flips some att. weights. In fact, if you do force the model to attend to those tokens globally (in all layers), it consistently does worse!?https://lnkd.in/dMU-fkRz - *Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models* - They find "language specific neurons" which, if deactivated, hinder the LLM's performance not only in that language but in others, sometimes not necessarily typologically close. It also seems to be the case that with more training OR bigger models, the tendency is to represent semantic information similarly and not in language-specific regions, and so a sort of lingua franca seems to emerge as we scale up.?https://lnkd.in/dQFMgDqG?
要查看或添加评论,请登录
-
?? Exploring LLM Evaluation Methods: Finding the Right Fit When it comes to evaluating Large Language Models (LLMs), there’s no one-size-fits-all approach. Each method has unique strengths but also trade-offs, especially across dimensions like cost, scalability, versatility, and reliability. Here’s a breakdown of some popular LLM evaluation methods and what they bring to the table: ?? LLM-Assisted Evaluation: Leveraging other language models to evaluate outputs is a novel technique that allows flexible scoring across various dimensions. While promising, it needs careful calibration for reliability. This method is still evolving but holds potential for flexible, scalable assessments. ?? Human Evaluation: Humans bring nuance, context, and adaptability, offering insights that automated methods often miss. However, this approach is resource-intensive and difficult to scale. Human evaluators are invaluable for complex or nuanced tasks, even with the high cost. ?? Metrics-Based Evaluation: Traditional metrics, like BLEU for translation or ROUGE for summarization, offer structured and consistent insights. While metrics are reliable for specific tasks, they may miss more subjective qualities like tone or creativity. They work best for measurable, well-defined outputs. ?? Match-Based Evaluation: Using regular expressions or schema validation, match-based evaluations provide precise feedback on specific criteria. This is useful for narrow applications but lacks versatility for broader tasks. Match-based methods are best for straightforward, rules-based checks. ?? Benchmarking: Benchmarks are industry standards that allow us to compare models on specific properties, setting a baseline for performance. However, benchmarks only measure what they’re designed for, potentially overlooking unique strengths. They’re reliable but not always comprehensive. Each method shines in different scenarios, but no single one covers all the bases. Balancing cost, scalability, versatility, and reliability is key to finding the right mix for evaluating any LLM application. Selecting the right evaluation method is a journey that’s shaping the future of GenAI! ??
要查看或添加评论,请登录
-
?? Some Tips on LLM Evaluation Metrics Evaluating the outputs of Large Language Models (LLMs) is crucial for developing robust LLM applications. This process is complex and essential for fine-tuning a model's accuracy or enhancing a Retrieval-Augmented Generation (RAG) system's contextual relevance. Understanding and selecting the appropriate evaluation metrics is key to building a reliable LLM evaluation pipeline. ?? LLM evaluation metrics score an LLM system's output based on criteria important to the user, helping to quantify performance. Common metrics include: ? Answer relevancy: Assesses if the output addresses the input effectively and concisely. ? Correctness: Checks if the output is factually accurate based on a ground truth. ? Hallucination: Determines if the output contains fake or fabricated information. ? Contextual relevancy: Evaluates if a retriever in a RAG-based system provides the most relevant information for context. ? Responsible metrics: Includes bias and toxicity metrics to ensure the output is free from harmful or offensive content. ? Task-specific metrics: Custom metrics tailored to specific use cases, such as summarization, may have unique criteria. While generic metrics are necessary, they often aren't sufficient for specific use cases, making custom task-specific metrics crucial for a production-ready LLM evaluation pipeline. ?? Characteristics of Great Evaluation Metrics: ? Quantitative: Metrics should compute a score, enabling the setting of a minimum passing threshold and tracking score changes over time. ? Reliable: Metrics should be consistent, as LLM outputs can be unpredictable. LLM-Evals, such as G-Eval, are often more accurate than traditional methods but can be inconsistent. ? Accurate: Metrics should accurately represent the LLM application's performance, aligning closely with human expectations. ?? By focusing on these principles, you can develop an effective LLM evaluation pipeline that ensures your models are reliable and accurate, meeting the necessary standards before deployment. We are diving into these topics in Week 4 of the LLM Zoomcamp by DataTalksClub.
要查看或添加评论,请登录
-
?? How to Measure the Brilliance of Large Language Models (LLMs) ?? ? Why Do We Need Evaluation Metrics for LLMs? Imagine you’re teaching someone to write essays. How do you know if they’re improving? You check for things like grammar, coherence, and relevance. Similarly, to judge the quality of an LLM, we need to evaluate how well it generates, understands, and interacts with text. That’s where evaluation metrics come in. ?? Key Metrics to Evaluate LLMs 1?? Perplexity - Measures how "surprised" the model is by the text it encounters. - Lower perplexity means the model understands the language better. Example: Sentence: "The cat sat on the mat." Perplexity Score: Low (common sentence). Sentence: "Quantum cat defenestrated the mat!" Perplexity Score: High (rare sentence). Tool: Libraries like Hugging Face compute perplexity scores. 2?? BLEU (Bilingual Evaluation Understudy) - Measures how closely the model’s output matches a reference text. - Commonly used for tasks like translation and summarization. Example: Reference: "The quick brown fox jumps over the lazy dog." Model Output: "The fast brown fox leaps over the sleepy dog." BLEU Score: High (close match). How It Works: Compares overlapping words or phrases. 3?? ROUGE (Recall-Oriented Understudy for Gisting Evaluation) - Evaluates how much of the key information in the reference text is captured by the model. - Used in summarization tasks. Example: Reference: "The cat sat on the mat because it was warm." Summary: "The cat sat on a warm mat." ROUGE Score: High (captures the main idea). 4?? F1 Score - Measures the balance between precision (relevance of results) and recall (completeness of results). - Commonly used for classification tasks. Example: Precision: How many of the model’s answers are correct? Recall: How many correct answers did the model miss? Perfect F1 Score: 1.0 (the model gets everything right and misses nothing). 5?? Human Evaluation - Sometimes, no metric beats human judgement. - Human evaluators check for: Fluency: Does the output read naturally? Coherence: Is it logical and on-topic? Usefulness: Does it answer the query or fulfil the task? Example: ChatGPT responses are often fine-tuned based on feedback from human evaluators. ?? How These Metrics Are Used in Real Life #Chatbots and Virtual Assistants evaluate how helpful, clear, and relevant responses are using F1 Score and human feedback. #Translation Tools: Google Translate and DeepL use BLEU to assess translation quality. #Summarization Tools: ROUGE scores help tools like Hugging Face evaluate the quality of summaries. ?? Evaluation metrics are the tools we use to judge how “intelligent” an AI model is. Whether it’s perplexity for understanding, BLEU for matching, or human evaluation for quality, these metrics ensure LLMs meet our expectations. ?? What metrics do you think best capture the intelligence of an AI model? And, How we can improve the way we evaluate AI?
要查看或添加评论,请登录
-
Can Language Models Ever Really Be Controlled? ? LLMs are impressive. They excel at open-ended generation and complex reasoning across diverse language tasks. But one core challenge remains: reliably controlling their output to meet specific logical criteria. This is no trivial matter. LLMs are being deployed in the real world, from writing aids to chatbots to decision-support systems. Fine-grained, dependable control over what they generate is essential for safe and responsible use. An LLM offering medical advice cannot contradict doctors or current medical knowledge. A legal writing assistant must rigorously follow provided templates and avoid statements with unintended legal consequences. A recent paper by?Zhang et al. [1] presents a promising approach called Ctrl-G. By innovatively merging the power of LLMs with techniques from tractable probabilistic modeling, Ctrl-G enables reliable, flexible control of LLM outputs via logical constraints. This could pave the way for LLMs that we can finally trust to stay on track. Is Ctrl-G the key to taming large language models? The results are promising. Ctrl-G combines LLMs’ linguistic knowledge with the reasoning power of discrete constraint automata. The result? A principled, general-purpose framework for fine-grained control over generated language. The authors show this approach already bears fruit. It enhances performance on language benchmarks. Aids reasoning in knowledge-intensive domains. And empowers end-users with semantic guardrails. But challenges remain. Complex logical constraints with many clauses or conditions can make Ctrl-G’s automata representations excessively large, impacting run-time efficiency. More fundamentally, not all desirable properties of generated language reduce to DFA-expressible logical constraints. We may want LLMs to adhere to abstract properties like truthfulness, safety, social norms, or ethics. Extending Ctrl-G to handle such complex, fuzzy constraints is an important future direction. https://lnkd.in/efSfygkw
要查看或添加评论,请登录
-
-
A relatively older study systematically analyzed how pre-training sequence composition strategies influence language models. Most frameworks concatenate multiple documents into fixed-length sequences and use causal masking to predict each token based on the preceding context. However, how this strategy impacts a model's abilities was an open question, until Facebook revealed this was a key part of training Llama3 ? The experiments compared causal masking models using different packing strategies with intra-document causal masking models where each token is only conditioned on preceding tokens from the same document.?? ? Key findings: ? 1. Causal masking can include distracting information from irrelevant documents, negatively impacting performance. IntraDoc models that eliminate this distraction significantly improved performance.?? ? 2. Increasing the relatedness of documents in pre-training chunks, as with UniChunk and BM25Chunk, can reduce distractions and improve performance. ? 3. BM25Chunk improved learning accuracy, knowledge memorization, and context utilization over MixChunk without sacrificing efficiency.? ? 4. Analysis revealed IntraDoc and BM25Chunk models were more robust to irrelevant contexts and better identified relevant information.? ? This work highlights the importance of pre-training sequence composition and introduces an efficient retrieval-based packing method to improve language models. Packing related documents and limiting conditioning to relevant context enables models to more effectively learn and utilize information during pre-training.??
要查看或添加评论,请登录
-
-
How Do Large Language Models Acquire Factual Knowledge During Pretraining? (KAIST, June 2024) Paper: https://lnkd.in/g4EWMkHg Abstract: "Despite the recent observation that large language models (LLMs) can store substantial factual knowledge, there is a limited understanding of the mechanisms of how they acquire factual knowledge through pretraining. This work addresses this gap by studying how LLMs acquire factual knowledge during pretraining. The findings reveal several important insights into the dynamics of factual knowledge acquisition during pretraining. First, counterintuitively, we observe that pretraining on more data shows no significant improvement in the model's capability to acquire and maintain factual knowledge. Next, there is a power-law relationship between training steps and forgetting of memorization and generalization of factual knowledge, and LLMs trained with duplicated training data exhibit faster forgetting. Third, training LLMs with larger batch sizes can enhance the models' robustness to forgetting. Overall, our observations suggest that factual knowledge acquisition in LLM pretraining occurs by progressively increasing the probability of factual knowledge presented in the pretraining data at each step. However, this increase is diluted by subsequent forgetting. Based on this interpretation, we demonstrate that we can provide plausible explanations for recently observed behaviors of LLMs, such as the poor performance of LLMs on long-tail knowledge and the benefits of deduplicating the pretraining corpus."
要查看或添加评论,请登录