Expanding on 'Patterns for Building LLM-based Systems & Products'
Image source: Midjourney

Expanding on 'Patterns for Building LLM-based Systems & Products'

Large language models (LLMs) are powerful tools for natural language processing and generation. They can perform a variety of tasks, such as answering questions, summarizing texts, writing essays, and more. However, building systems and products powered by LLMs is not a trivial task. It requires careful design, engineering, and evaluation to ensure high quality, reliability, and user satisfaction.

In a comprehensive post, Eugene Yan covered seven key patterns for building LLM-based systems and products. These patterns are:

  • Evaluations (Evals): To measure performance
  • Retrieval-Augmented Generation (RAG): To add recent, external knowledge
  • Fine-Tuning: To get better at specific tasks
  • Caching: To reduce latency and cost
  • Guardrails: To ensure output quality
  • Defensive UX: To anticipate and manage errors gracefully
  • Collecting User Feedback: To build a data flywheel

No alt text provided for this image
LLM patterns: From data to user, from defensive to offensive. Image source: https://eugeneyan.com/writing/llm-patterns/


In this article, we will dive deeper into a few of these patterns, highlighting how they can be applied when developing LLM-based solutions.

Evaluations (Evals)

Evaluations are a set of measurements used to assess a model’s performance on a task. They include benchmark data and metrics. Evaluations enable us to measure how well our system or product is doing and detect any regressions.

One of the challenges of evaluating LLMs is that they often produce open-ended outputs that are difficult to compare with a single reference or ground truth. For example, how do we measure the quality of a summary or a dialogue generated by an LLM? One possible solution is to use human evaluations, where we ask human annotators to rate the outputs based on some criteria. However, human evaluations can be noisy, subjective, expensive, and slow.

An alternative solution is to use GPT-4 as an automated evaluator. GPT-4 is the latest and largest LLM developed by OpenAI, with 175 billion parameters.?It has shown high correlation with human judgments for open-ended generation tasks like summarization and dialogue. Compared to human evaluations, GPT-4 provides more systematic and less biased scoring. With prompt engineering, we can frame the evaluation as a comparison task, where we ask GPT-4 to choose between two outputs based on some criteria. For example, we can ask GPT-4 to compare two summaries based on their informativeness, conciseness, fluency, or coherence. This way, we can reduce the noise and ambiguity of the evaluation task.

Overall, GPT-4 is emerging as a cost-effective option for automated LLM evaluations. It can provide fast and consistent feedback for LLM developers and users.

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a technique that combines LLMs with document retrieval to generate outputs that incorporate recent and external knowledge. RAG works by first retrieving relevant documents from a large corpus based on the input query or context, then feeding the retrieved documents as additional context to the LLM for generation.

One of the challenges of RAG is that document retrieval can be noisy and incomplete. For example, keyword-based search may miss documents that do not contain the exact keywords but are semantically related to the query. On the other hand, embedding-based semantic search may retrieve documents that are too general or irrelevant to the query.

One possible solution is to use feature-extraction based search, which analyzes documents and extracts informative features to create a search index. These features can capture various aspects of the documents, such as semantics, topics, entities, sentiment, tone, style, etc. For example, we can use named entity recognition (NER) to extract entities like people, places, organizations, etc., from the documents. We can also use topic modeling to extract topics like sports, politics, entertainment, etc., from the documents. Then we can use these features as additional filters or boosters for document retrieval.

For LLM systems, we can extract features that are relevant to the downstream task or domain. For example, for a summarization task, we can extract features like length, novelty, diversity, etc., from the documents. For a medical domain, we can extract features like symptoms, diagnosis, treatment, etc., from the documents. By using feature extraction based search, we can enhance document retrieval beyond keywords and embeddings alone.

Overall, feature extraction further strengthens hybrid retrieval for providing diverse and relevant context to LLMs.

Fine-Tuning

Fine-tuning is a technique that adapts an LLM to a specific task or domain using task-specific demonstrations and data. Fine-tuning provides several benefits for LLM systems and products.

First, fine-tuning provides control over model behavior. LLMs are primarily trained on next word prediction using large amounts of text data from various sources. This means that they may not be well-suited for certain tasks or domains that require specific knowledge, skills, or styles. For example, an LLM may not be able to generate accurate medical reports or legal documents without fine-tuning on domain-specific data. Fine-tuning allows us to customize the LLM’s behavior to match the task or domain requirements.

Second, fine-tuning enables building differentiated products. LLMs are becoming more accessible and widely used by various developers and users. This means that there is more competition and less differentiation among LLM-based products. For example, there are many LLM-based chatbots that can answer general questions or have casual conversations. Fine-tuning allows us to create unique and specialized products that can cater to specific user needs or preferences. For example, we can fine-tune an LLM to create a chatbot that can provide personalized recommendations or advice based on user profiles or preferences.

Third, fine-tuning improves performance. LLMs are powerful but not perfect. They may still make errors or generate outputs that are not satisfactory for the task or domain. Fine-tuning allows us to improve the LLM’s performance by using task-specific demonstrations and data. For example, we can fine-tune an LLM to generate better summaries by using summaries written by human experts as demonstrations. We can also fine-tune an LLM to generate better dialogues by using dialogues collected from real users as data.

Overall, fine-tuning unlocks an LLM’s potential for narrow tasks and domains.

Caching

Caching is a technique that stores previously generated outputs in a memory for faster retrieval later. Caching provides several benefits for LLM systems and products.

First, caching reduces latency. LLM generation can be slow and expensive, especially for large models and complex tasks. This can affect user experience and satisfaction. Caching shifts latency from LLM generation (seconds) to cache lookup (milliseconds). This means that we can serve cached outputs faster than generating new outputs.

Second, caching reduces cost. LLM generation can be costly, especially for cloud-based models and services. This can affect profitability and scalability. Caching shifts cost from LLM generation (dollars) to cache storage (cents). This means that we can save money by serving cached outputs rather than generating new outputs.

One of the challenges of caching is that cached outputs may become stale or irrelevant over time. For example, a cached answer to a question may become outdated or incorrect due to new information or events. One possible solution is to use pre-computing for caching.

Pre-computing is a technique that generates outputs offline or asynchronously before serving them online. Pre-computing allows us to update the cache periodically with fresh and relevant outputs. For example, we can pre-compute answers to frequently asked questions or common queries using the latest data and information available. We can also pre-compute outputs for anticipated queries based on user behavior or trends.

Pre-computing also provides another benefit: batch efficiency. Batch efficiency is the ability to generate multiple outputs at once rather than one at a time. Batch efficiency reduces the per-output cost and latency of generation by leveraging parallelism and amortization. Pre-computing enables batch efficiency by allowing us to generate outputs in batches offline or asynchronously rather than in real-time online.

Overall, pre-computing and caching shifts the compute burden offline while enabling fast and cheap serving online.

Guardrails

Guardrails are a set of rules or constraints that ensure output quality and reliability. Guardrails act as validators that check the output before serving it to the user. Guardrails provide several benefits for LLM systems and products.

First, guardrails prevent errors and failures. LLMs are not perfect and may generate outputs that are incorrect, irrelevant, inappropriate, or harmful. These outputs can affect user trust and satisfaction, as well as cause legal or ethical issues. Guardrails prevent these outputs from reaching the user by filtering them out or correcting them.

Second, guardrails enforce standards and expectations. LLMs may generate outputs that are inconsistent, incomplete, or incomprehensible. These outputs can affect user understanding and engagement, as well as cause confusion or frustration. Guardrails enforce standards and expectations for output quality and format by ensuring they meet certain criteria or specifications.

There are many types of guardrails that can be applied to LLM systems and products, depending on the task, domain, and user requirements. Here are some examples of guardrails:

  • Structural guardrails ensure output conforms to expected JSON schema.
  • Syntactic guardrails check for valid code syntax.
  • Content safety guardrails screen for inappropriate language.
  • Semantic guardrails validate factual accuracy and relevance to context.
  • Input guardrails limit inputs to mitigate harmful responses.

Overall, guardrails act as validators to ensure high quality and reliable LLM outputs.

Defensive UX

Defensive UX is a design principle that anticipates errors and gracefully handles imperfect AI interactions. Defensive UX provides several benefits for LLM systems and products.

First, defensive UX sets proper user expectations. LLMs are not perfect and may generate outputs that are inaccurate, irrelevant, inappropriate, or incomprehensible. These outputs can affect user trust and satisfaction, as well as cause frustration or disappointment. Defensive UX sets proper user expectations by communicating the capabilities and limitations of the LLM system or product, as well as providing feedback and guidance on how to use it effectively.

Second, defensive UX enables efficient dismissal. LLMs may generate outputs that are undesired, unwanted, or unhelpful. These outputs can affect user experience and engagement, as well as cause annoyance or distraction. Defensive UX enables efficient dismissal by allowing the user to easily reject or ignore the LLM output, as well as providing alternative options or suggestions.

Third, defensive UX enhances user trust. LLMs may generate outputs that are uncertain, ambiguous, or incomplete. These outputs can affect user confidence and understanding, as well as cause confusion or doubt. Defensive UX enhances user trust by providing context and attribution for the LLM output, as well as explaining the reasoning or logic behind it.

There are many guidelines and best practices for designing defensive UX for LLM systems and products. Here are some key similarities in human-AI interaction guidelines from Microsoft, Google, and Apple:

  • Set proper user expectations about capabilities: Communicate what the LLM can and cannot do, how accurate or reliable it is, and what kind of inputs or outputs it expects.
  • Enable efficient dismissal of undesired AI behaviors: Allow the user to easily undo, cancel, or skip the LLM output, as well as provide alternative options or suggestions.
  • Provide context and attribution for AI outputs: Indicate where the LLM output comes from, how it was generated, and what sources or data it used.
  • Anchor new AI features on familiar UX: Use existing UI elements, patterns, or metaphors to introduce new LLM features or functionalities.
  • Anticipate errors and gracefully handle imperfect AI interactions: Detect and prevent potential errors or failures, provide clear and helpful error messages, and offer recovery or mitigation strategies.

Overall, their guidelines highlight anticipating errors and gracefully handling imperfect AI interactions.

Implicit User Feedback

Implicit user feedback is a type of feedback that is inferred from user behavior rather than explicitly provided by the user. Implicit user feedback provides several benefits for LLM systems and products.

First, implicit user feedback provides diverse behavioral data. LLMs can learn from various types of implicit user feedback, such as usage frequency, duration, intensity, pattern, etc. These types of feedback can capture different aspects of user behavior, such as satisfaction, engagement, effort, preference, etc. For example, usage over time indicates product stickiness. Conversation length can signal engagement or effort needed.

Second, implicit user feedback enables continuous improvement. LLMs can use implicit user feedback to optimize their performance and behavior for user needs. For example, copilot-style assistants collect rich signal on which suggestions users accept or ignore. Chatbots learn from query patterns and conversation flows.

Third, implicit user feedback reduces user burden. LLMs can collect implicit user feedback without requiring any additional input or action from the user. This reduces user burden and increases user convenience.

There are many ways to collect implicit user feedback for LLM systems and products, depending on the task, domain, and user requirements. Here are some examples of implicit user feedback for LLMs:

  • Usage over time: Indicates product stickiness
  • Conversation length: Signals engagement or effort needed
  • Suggestion acceptance: Reflects usefulness or relevance
  • Query reformulation: Implies dissatisfaction or confusion
  • Dialogue branching: Reveals interest or preference

In essence, implicit user feedback provides diverse behavioral data to optimize LLMs for user needs.

Conclusion

In this article, we have expanded on some of the key patterns for building LLM-based systems and products covered in a comprehensive post by Eugene Yan. These patterns are:

  • Evaluations (Evals): To measure performance using GPT-4 as an automated evaluator
  • Retrieval-Augmented Generation (RAG): To add recent, external knowledge using feature-extraction based search
  • Fine-Tuning: To get better at specific tasks using task-specific demonstrations and data
  • Caching: To reduce latency and cost using pre-computing and batch efficiency
  • Guardrails: To ensure output quality using various types of validators
  • Defensive UX: To anticipate and manage errors gracefully using human-AI interaction guidelines
  • Collecting User Feedback: To build a data flywheel using implicit user feedback

We hope this article has provided some useful insights and tips for developing LLM-based solutions.

References:

Liu, Yang, et al.?“Gpteval: Nlg evaluation using gpt-4 with better human alignment.”?arXiv preprint arXiv:2303.16634 (2023).

Yan, Ziyou. (Jul 2023). Patterns for Building LLM-based Systems & Products. eugeneyan.com. https://eugeneyan.com/writing/llm-patterns/.

People + AI Guidebook?(2023).

Human Interface Guidelines for Machine Learning?(2023)

Guidance?(2023).

要查看或添加评论,请登录

Antonio (Toni)的更多文章

社区洞察

其他会员也浏览了