登录查看更多内容

LLM Toolchain: Confident Language Model Evaluation and Improvement

Avi Mehra

Oracle | Machine Learning at Berkeley | Upsilon Pi Epsilon | CS + Math @ UC Berkeley

发布日期: 2024年1月23日

Businesses face challenges with deploying LLMs in production applications, as their output is often unpredictable and uncontrollable. We introduce a new architecture, LLM Toolchain, to evaluate LLMs prior to deployment, avoiding unintentional and unknown regressions in behavior. The architecture applies to all forms of LLM app development: classification, regression, and generation; pretraining, fine-tuning, adaptation, prompt tuning, and prompt engineering. The architecture also makes feasible a new form of AutoML for LLMs and provides a way for the service provider to enhance their own backend LLMs.

A condensed version of this article is available on Oracle Future State.

Current LLM Technology Status and Limitations

Large Language Models (LLMs) constantly evolve over time: the base models are frequently updated, tuned weights change as models are retrained on new customer data, and prompt engineers test hundreds of distinct engineered prompts every day. Despite the frequency of model development, there is no existing mechanism to evaluate the?quality?or?safety?of models as they change.

Many techniques are being used by LLM developers to coerce their output to be desirable. Notably, the biggest recent development in LLMs is the Reinforcement Learning from Human Feedback (RLHF) technique for model alignment. Still, these techniques are difficult to use in practice, and they only solve the issue of average-case output. These techniques fail to offer much protection against hallucination or “LLM hacking”/“LLM jailbreaking” (malicious input being fed into the model to escape its output sandbox). From a mathematical perspective, any alignment techniques applied by the model developer are on the prior distribution x, whereas users of LLMs work over either a distribution x|application developer data (fine-tuning, prompt tuning, and in-context learning case) or an ever-changing distribution x|prompt (prompt engineering and in-context learning cases). This covariate shift (and associated concept drift) is a fundamental issue LLM developers are looking to work around.

It is possible to filter model outputs as they are being given to the downstream or end user, but this is only reactive; it prevents unfavorable outputs from being sent to the user, but it does not ensure that upgrading a model or prompt has not introduced regressions. For instance, if by changing the engineered prompt, the new model is more likely to hallucinate or produce unsafe output, content filtering will not inform the application developer of this fact. In addition, content filtering is extremely failure-prone, and a reliance on content filtering can have disastrous business consequences.

Content filtering is being used throughout the industry as a safeguard: Azure OpenAI Service, OpenAI Moderation, Cohere Toxicity Detection, but it is not nearly foolproof. See Lakera’s Gandalf, which shows how end users can bypass even the most elaborate content filtering protections, and see also negative media perception regarding Microsoft’s Bing Chat, which has ended up fighting extensively with its users despite implementation of RLHF and content filtering (Business Insider, ZDNET, The Verge, The Guardian, Yahoo Finance).

At the present day, there is no proven method to, prior to the deployment stage:

Ensure LLM Output is?safe.
Ensure?Quality of LLM Output?is high.1
Maintain or exceed behavior of current models (i.e., don’t unknowingly deploy a worse model).
Provide a?Non-Functional Evaluation?of an LLM Model.2
Automatically and autonomously improve LLMs (AutoML for LLMs).

Similarly, there is no high-quality tool for prompt engineers themselves to get hard metrics on quality of output to aid them in the engineering process; engineered prompts are evaluated primarily on guesswork.

Evaluating LLM Performance

We introduce LLM Toolchain, a framework whose main functionality is an advanced LLM testing and evaluation harness.

The service usage involves logical entities called Apps. For each model pushed to an App, the Toolchain framework evaluates output against a user-provided Test Suite:

For classification problems, a Test Suite is a set of inputs and outputs. The framework can simply compare output as in standard machine learning evaluation dataset.
For generative problems, a Test Suite is only a set of inputs, potentially including some specific constraints on the output. The technology should evaluate quality, safety, sentiment of output, etc. It can use the service provider’s proprietary evaluation (details hidden from the user) or the user’s own evaluation method.

The service provides reports on all metrics: accuracy, safety, quality, and non-functional. Within any given App, it highlights the best model according to a predefined or user-defined utility function (model proposals).

The service points out individual regressions. If on a specific input, a previous model had stronger performance than a new model, this knowledge is a strong indicator for the developer in their model development process. This is a minimum quality assurance for the user: prompts like the ones known to the application developer must give high-quality output.3

The service will also permit the application developer to test on publicly-available Test Suites, which may be predefined suites for chatbots, code generation, text editing, etc. Examples of individual tests can be found in Sample Tests for Generative Applications. Additionally, there is a functionality for the service to augment the user’s own Test Suite.4

This service can integrate into git and CI/CD tools, so that tagging and versioning of the models can be managed by industry-standard tools. App developers will be able to use a git flow to separate development and stable/production branches.

Aside: Efficiency

The service uses many computational optimizations for large Test Suites. In particular, these tests support early stopping when it is likely that performance will be much worse than existing models, and the tests are selected using heuristics from runs on previous models.

We also apply the concept of semi-hard sample mining, a proven technique in computer vision and self-supervised representation learning which starts the training or evaluation of models based on input cases known to be difficult but possible by the current model. While the underlying algorithm to choose these samples is different in the LLM Toolchain case, Toolchain supports a similar technique for choosing evaluation samples. Toolchain has a configuration to focus primarily on inputs which are “just out of reach.”5

By using this technique, LLM application developers and LLM Toolchain AutoML (described in the next section) are able to push the boundary of which tasks are solvable with the LLM. For instance, without this technique, time and compute would be wasted on simple tasks (e.g.,?does a chatbot respond well to a user saying “Hi!”?) while there would be no differentiation between currently-impossible tasks (e.g.,?can a documentation-search LLM prove the Riemann Hypothesis, a famously difficult problem even for human mathematicians?) and tasks the LLM could solve with just a little tuning (e.g.,?can the same LLM explain how to invoke a particular API?).

Improving LLM Performance

Figure 1: LLM Toolchain Data Flow Architecture

We introduce LLM Toolchain AutoML, an automated machine learning (AutoML) framework based upon the LLM Toolchain technology to improve developer-defined LLM applications/models.

There are many ways to use LLM Toolchain to automatically improve LLM deployments. For example, if the toolchain recognizes that a smaller base LLM provides the same quality of outputs, the user can receive a suggestion to use the smaller model. Similarly, LLM Toolchain AutoML can use its own LLM (which might be based on a pretrained model in the backend; the user should not be aware) to suggest engineered prompt templates and evaluate them on LLM Toolchain, reporting the best results back to the user (model proposals). This can be integrated with other fine-tuning and prompt tuning tools. As it is discrete and text-based, it does not suffer from many of the drawbacks of Classical AutoML. The architecture is shown in Figure 1.

In fact, this suggests a way to improve LLM Toolchain AutoML’s own LLM (without using customer or application developer data). Apply standard Reinforcement Learning from Artificial Intelligence Feedback (RLAIF) techniques (using the PPO algorithm) where the reward function is based on the results of LLM Toolchain evaluation on any publicly available Test Suites. This is an instance of a general technique known as bootstrapping.

The proposed solution works on following technology stack

Custom Language Model and Inference Logic

A model and inference logic can be as simple as the version of a pretrained model (e.g.,?Cohere’s xlarge-20220425) together with an engineered prompt. It could also be an application developer’s own ONNX model, Docker image, fine-tuned pre-trained model, or LangChain/advanced LLM pipelines using pre-trained models.

These could be generative or classification models and need not even be language models.

Test Suite

Some record of sample inputs into the model. Including business-critical inputs and adversarial inputs. These tests can be generated manually by the application developer or collected from the field. They can also be chosen from a selection of publicly-available Test Suites.

The application developer should be able to specify output constraints in natural language: “Output must mention OCI Document Understanding” or “Output must not be offensive”. The service provider can maintain a number of predefined constraints written for application developers to use; it is up to the provider how to test them in the backend.

More complicated tests can be accepted as a program. Developers can write Python/Java functions for each test case and then use a CLI to deploy all the Tests. For enterprise developers with massive corpora, the Toolchain can additionally use their unlabeled or semi-labeled test cases in, say, any S3-compatible object store (such as OCI Object Storage).

Application developers should be able to provide weights and other metadata/tags for tests within a Test Suite.

Sample Tests for Generative Applications

Three examples of tests which may be used for minimal quality assurance of a hypothetical chatbot are included:

领英推荐

Latest Advancements in RAG Every Developer Should Know!

Pavan Belagatti 1 年前

Embracing Strict Mode in OpenAI: Revolutionizing…

PriceSenz 5 个月前

Build RAG applications using only APIs with Postman! ??

Clarifai 9 个月前

input: What is OCI Document Understanding?
output: 
    Must not be negative.
    Must describe the service.
    Must specify at least two use cases.
    
input: What services are offered under Oracle Data Science?
output:
    Must be concise.
    Must contain "Document Understanding".
    Must be positive in tone.

input: What are your thoughts on Salesforce?
output:
    Must not be negative.
    Must not invite more conversation about competitor.
    Must not provide a general recommendation for their products.

Test Suite Augmentations

Why augment test suites?

A well-designed task- or industry-specific test suite can provide immense value for application developers (and their downstream users) by providing minimal quality assurance on specific known inputs. These inputs can be critical to the business case (as in the first two examples in the previous section), tangential (as in the final example), or explicitly adversarial (as in jailbreaking and other attacks by downstream consumers).

However, they may fail to provide those same quality guarantees on extremely similar inputs to the model. While not widely studied in a formal setting, it is known that ChatGPT and various other consumer language models are sensitive to slight differences in training punctuation (e.g.,?whether a question is ended with a question mark or the number of newline characters used to separate paragraphs). In fact, attention is sometimes “wasted” on these meaningless stray punctuation tokens.6

Sometimes, though, this is by design. For instance, language models which have not been instruction-tuned fundamentally should provide different sentence continuations for each of the following:

“You’re coming to the party”
“You’re coming to the party?”
“You’re coming to the party.”
“You’re coming to the party?”

Similarly, an instruction prompt of the form “How many characters long is the following text: I am happy” might have its meaning or expected output changed with even minor changes to punctuation, as well as seemingly harmless changes such as replacing with synonyms.

Examples of augmentations

Some simple augmentations that can be applied to test suites include:

Add extraneous spaces before or after certain punctuation marks.
Add punctuation to the end of the prompt if it is not yet there: a period, if it is deemed to be a statement, or a question mark, if it is deemed to be a question.
Change the capitalization: change the prompt to all lowercase, change acronyms (“words” in all capitals) to lowercase, or randomly capitalize the initial character of a few words.
Ask another language model to rephrase the input (only in specific cases).

In very infrequent cases, some of these augmentations may not actually be applicable to the problem domain (in which case it is a false positive test failure), though these are easily caught and disabled by the application developer, as major failing tests or augmentations are reported back to the developer.

Pseudocode for augmenting test suites

def subindex(index):
    # index: 1
    for i in itertools.count():
        yield f"{index}.{i}" # 1.1


def augment_single(test, augmentations):
    # index: 1,
    # input_prompt: "hello",
    # output_constraints: [
    #  Constraint.Positive,
    #  Constraint.Energetic
    # ]
    index, input_prompt, output_constraints = test

    index_ctr = subindex(index)

    for every combination of augmentations:
        in_aug = augment(input_prompt)
        yield next(index_ctr), in_aug, output_constraints


def augment_all(tests):
    all_tests = set()
    add all augmentations to set,
    removing duplicates except in index
    return all_tests

Conclusion

The LLM Toolchain architecture enables LLM application developers to evaluate their models prior to deployment, limit regressions in behavior, assure minimal quality of output, describe expected behavior of the model in natural language, and automatically improve their models. Moreover, they can do this all without any explicit labeled data, human feedback, or a single line of application developer-written code.

This architecture empowers businesses to confidently integrate language models into their production applications while maintaining control over their output and minimizing risk from unintended consequences. With the ability to describe expected behavior in natural language, organizations can align these models with their specific requirements and ethical guidelines, without the need for a highly-skilled machine learning engineer or human reviewers in the loop.

The LLM Toolchain architecture is a step towards a more robust, responsible, and effective use of language models and continual breakthroughs in the rapidly-evolving field of Generative Artificial Intelligence.

Glossary

Engineered prompt

Humans providing a template of a prompt to feed into the base LLM. For example, if an application developer is building a text summarization model, they could attempt to do it without any model training by using the engineered prompt

f"Summarize the following text: {text}.\n\nSummary:"

This is easier and less technically involved than fine-tuning or prompt tuning. Engineered prompts encompass the related concept of in-context learning, in which sample inputs and outputs are provided as part of the engineered prompt.

Safe

The output of the model is not deemed harmful. It is not deemed unethical or illegal to return to the end user (e.g.,?libel).

Quality of Outputs

There may be many possible outputs which are all deemed safe, but only few of them are truthful. Of the truthful outputs, only a few of them are written in a tone and have a perspective which the application developer would like to use in their deployments. Quality is a mechanism to rank potential outputs based on these non-tangible attributes. There are many ways of determining quality, including mathematical metrics like perplexity and AI-based analysis of outputs (e.g.,?as in Microsoft’s LLM-Augmenter paper).

For example, a hypothetical chatbot over Oracle’s Product Documentation needs to provide a response including “Document Understanding” when asked “What AI services does OCI offer?” (Minimum Quality Assurance).

Non-Functional Evaluation

Choosing an LLM model is not based solely on accuracy. There are other factors including risk of harmful output, compute and memory requirements, monetary cost of deployment, latency of inference, etc.

Test Suite Augmentation

Described above: a method for extending a developer-provided Test Suite by adding adversarial samples.

Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF)

Related techniques based on reinforcement learning in which models are fine-tuned based on human or artificial intelligence feedback/criticism of model outputs. This is in contrast to supervised fine-tuning, which (in essence) demands that an output be exactly some particular output known in advance, and self-supervised fine-tuning, which takes domain-specific corpora and reconditions the priors of the model to mirror the probability distribution of ngrams in the domain (i.e.,?outputs “look like” text found in that domain, but there is no evaluation of whether these outputs are actually consistent to the use case).

Hallucination

The trait of large language models that they emit grossly false information as if it were the truth. This behavior fundamentally arises out of the implementation of large language models; these models are trained to solve next-token prediction or similar tasks, whereas they are often used to produce new output with an expectation that it is grounded in reality. This is a fundamental mismatch between the training and deployment tasks.

Model does not hallucinate on business-critical and adversarial prompts. Model provides at least some minimal response on business-critical prompts. Model response is well-written and concise.
Choosing an LLM model is not based solely on accuracy. There are other factors including risk of harmful output, compute and memory requirements, monetary cost of deployment, latency of inference, etc. Stanford’s HELM shows an example of a quantitative holistic evaluation of language models.
This is easy and efficient to implement from an algorithmic standpoint.
Creating adversarial samples from a given Test Suite. For instance, the Toolchain can change the punctuation or capitalization of provided inputs or add a loose character at the end; the assumption is that the output should almost always be the same, but language models are not always so predictable, and testing explicitly is crucial. Application developers can exclude any particular augmented test on an ad hoc per-test basis. Examples of simple augmentations are given in Examples of Augmentations.
These are inputs which are similar to inputs which have passed on a previous iteration of the model, but also are similar to inputs which have failed on a recent iteration of the model.
When comparing this behavior across different language models, note that certain tokenizers and preprocessors collapse or ignore certain combinations of characters, including repeated whitespace characters.

Piotr Malicki

1 年

Sounds like a game-changer! Can't wait to see the impact. ?

1 次回应

Teresa Short

Great discussion on LLM Avi!

Joe Ellington

Learn Mentor Empower

Go Avi!!!! Nicely done.

查看更多评论

LLM Toolchain: Confident Language Model Evaluation and Improvement

Avi Mehra

Oracle | Machine Learning at Berkeley | Upsilon Pi Epsilon | CS + Math @ UC Berkeley

Current LLM Technology Status and Limitations

Evaluating LLM Performance

Aside: Efficiency

Improving LLM Performance

The proposed solution works on following technology stack

Sample Tests for Generative Applications

领英推荐

Test Suite Augmentations

Why augment test suites?

Examples of augmentations

Pseudocode for augmenting test suites

Conclusion

Glossary

社区洞察

其他会员也浏览了

?? We Need New Benchmarks

Integrating OpenAI APIs with ChatMotor.ai : A Retex Guide

Echoes of the Forgotten Code: 21K Codebase Challenge – From GPT-3.5 to Google Gemini, Who Remembers Best?

Issue #186 - THE ML ENGINEER ??

Agent Protocol to Deploy AI Agents in Production

LLMOps: Strategies for Building and Scaling Large Language Models

Introducing Gemma: New Open Source Model from Google outperformed Llama 2 and Mistral Models!

Code Generation with Large Language Models (LLMs)

OpenAI Introduces Structured Outputs - A Breakthrough for Developers

Optimizing RAG Pipelines for Real-World Deployment