LLM Toolchain: Confident Language Model Evaluation and Improvement
Businesses face challenges with deploying LLMs in production applications, as their output is often unpredictable and uncontrollable. We introduce a new architecture, LLM Toolchain, to evaluate LLMs prior to deployment, avoiding unintentional and unknown regressions in behavior. The architecture applies to all forms of LLM app development: classification, regression, and generation; pretraining, fine-tuning, adaptation, prompt tuning, and prompt engineering. The architecture also makes feasible a new form of AutoML for LLMs and provides a way for the service provider to enhance their own backend LLMs.
A condensed version of this article is available on Oracle Future State.
Current LLM Technology Status and Limitations
Large Language Models (LLMs) constantly evolve over time: the base models are frequently updated, tuned weights change as models are retrained on new customer data, and prompt engineers test hundreds of distinct engineered prompts every day. Despite the frequency of model development, there is no existing mechanism to evaluate the?quality?or?safety?of models as they change.
Many techniques are being used by LLM developers to coerce their output to be desirable. Notably, the biggest recent development in LLMs is the Reinforcement Learning from Human Feedback (RLHF) technique for model alignment. Still, these techniques are difficult to use in practice, and they only solve the issue of average-case output. These techniques fail to offer much protection against hallucination or “LLM hacking”/“LLM jailbreaking” (malicious input being fed into the model to escape its output sandbox). From a mathematical perspective, any alignment techniques applied by the model developer are on the prior distribution x, whereas users of LLMs work over either a distribution x|application developer data (fine-tuning, prompt tuning, and in-context learning case) or an ever-changing distribution x|prompt (prompt engineering and in-context learning cases). This covariate shift (and associated concept drift) is a fundamental issue LLM developers are looking to work around.
It is possible to filter model outputs as they are being given to the downstream or end user, but this is only reactive; it prevents unfavorable outputs from being sent to the user, but it does not ensure that upgrading a model or prompt has not introduced regressions. For instance, if by changing the engineered prompt, the new model is more likely to hallucinate or produce unsafe output, content filtering will not inform the application developer of this fact. In addition, content filtering is extremely failure-prone, and a reliance on content filtering can have disastrous business consequences.
Content filtering is being used throughout the industry as a safeguard: Azure OpenAI Service, OpenAI Moderation, Cohere Toxicity Detection, but it is not nearly foolproof. See Lakera’s Gandalf, which shows how end users can bypass even the most elaborate content filtering protections, and see also negative media perception regarding Microsoft’s Bing Chat, which has ended up fighting extensively with its users despite implementation of RLHF and content filtering (Business Insider, ZDNET, The Verge, The Guardian, Yahoo Finance).
At the present day, there is no proven method to, prior to the deployment stage:
Similarly, there is no high-quality tool for prompt engineers themselves to get hard metrics on quality of output to aid them in the engineering process; engineered prompts are evaluated primarily on guesswork.
Evaluating LLM Performance
We introduce LLM Toolchain, a framework whose main functionality is an advanced LLM testing and evaluation harness.
The service usage involves logical entities called Apps. For each model pushed to an App, the Toolchain framework evaluates output against a user-provided Test Suite:
The service provides reports on all metrics: accuracy, safety, quality, and non-functional. Within any given App, it highlights the best model according to a predefined or user-defined utility function (model proposals).
The service points out individual regressions. If on a specific input, a previous model had stronger performance than a new model, this knowledge is a strong indicator for the developer in their model development process. This is a minimum quality assurance for the user: prompts like the ones known to the application developer must give high-quality output.3
The service will also permit the application developer to test on publicly-available Test Suites, which may be predefined suites for chatbots, code generation, text editing, etc. Examples of individual tests can be found in Sample Tests for Generative Applications. Additionally, there is a functionality for the service to augment the user’s own Test Suite.4
This service can integrate into git and CI/CD tools, so that tagging and versioning of the models can be managed by industry-standard tools. App developers will be able to use a git flow to separate development and stable/production branches.
Aside: Efficiency
The service uses many computational optimizations for large Test Suites. In particular, these tests support early stopping when it is likely that performance will be much worse than existing models, and the tests are selected using heuristics from runs on previous models.
We also apply the concept of semi-hard sample mining, a proven technique in computer vision and self-supervised representation learning which starts the training or evaluation of models based on input cases known to be difficult but possible by the current model. While the underlying algorithm to choose these samples is different in the LLM Toolchain case, Toolchain supports a similar technique for choosing evaluation samples. Toolchain has a configuration to focus primarily on inputs which are “just out of reach.”5
By using this technique, LLM application developers and LLM Toolchain AutoML (described in the next section) are able to push the boundary of which tasks are solvable with the LLM. For instance, without this technique, time and compute would be wasted on simple tasks (e.g.,?does a chatbot respond well to a user saying “Hi!”?) while there would be no differentiation between currently-impossible tasks (e.g.,?can a documentation-search LLM prove the Riemann Hypothesis, a famously difficult problem even for human mathematicians?) and tasks the LLM could solve with just a little tuning (e.g.,?can the same LLM explain how to invoke a particular API?).
Improving LLM Performance
We introduce LLM Toolchain AutoML, an automated machine learning (AutoML) framework based upon the LLM Toolchain technology to improve developer-defined LLM applications/models.
There are many ways to use LLM Toolchain to automatically improve LLM deployments. For example, if the toolchain recognizes that a smaller base LLM provides the same quality of outputs, the user can receive a suggestion to use the smaller model. Similarly, LLM Toolchain AutoML can use its own LLM (which might be based on a pretrained model in the backend; the user should not be aware) to suggest engineered prompt templates and evaluate them on LLM Toolchain, reporting the best results back to the user (model proposals). This can be integrated with other fine-tuning and prompt tuning tools. As it is discrete and text-based, it does not suffer from many of the drawbacks of Classical AutoML. The architecture is shown in Figure 1.
In fact, this suggests a way to improve LLM Toolchain AutoML’s own LLM (without using customer or application developer data). Apply standard Reinforcement Learning from Artificial Intelligence Feedback (RLAIF) techniques (using the PPO algorithm) where the reward function is based on the results of LLM Toolchain evaluation on any publicly available Test Suites. This is an instance of a general technique known as bootstrapping.
The proposed solution works on following technology stack
Custom Language Model and Inference Logic
A model and inference logic can be as simple as the version of a pretrained model (e.g.,?Cohere’s xlarge-20220425) together with an engineered prompt. It could also be an application developer’s own ONNX model, Docker image, fine-tuned pre-trained model, or LangChain/advanced LLM pipelines using pre-trained models.
These could be generative or classification models and need not even be language models.
Test Suite
Some record of sample inputs into the model. Including business-critical inputs and adversarial inputs. These tests can be generated manually by the application developer or collected from the field. They can also be chosen from a selection of publicly-available Test Suites.
The application developer should be able to specify output constraints in natural language: “Output must mention OCI Document Understanding” or “Output must not be offensive”. The service provider can maintain a number of predefined constraints written for application developers to use; it is up to the provider how to test them in the backend.
More complicated tests can be accepted as a program. Developers can write Python/Java functions for each test case and then use a CLI to deploy all the Tests. For enterprise developers with massive corpora, the Toolchain can additionally use their unlabeled or semi-labeled test cases in, say, any S3-compatible object store (such as OCI Object Storage).
Application developers should be able to provide weights and other metadata/tags for tests within a Test Suite.
Sample Tests for Generative Applications
Three examples of tests which may be used for minimal quality assurance of a hypothetical chatbot are included:
领英推荐
input: What is OCI Document Understanding?
output:
Must not be negative.
Must describe the service.
Must specify at least two use cases.
input: What services are offered under Oracle Data Science?
output:
Must be concise.
Must contain "Document Understanding".
Must be positive in tone.
input: What are your thoughts on Salesforce?
output:
Must not be negative.
Must not invite more conversation about competitor.
Must not provide a general recommendation for their products.
Test Suite Augmentations
Why augment test suites?
A well-designed task- or industry-specific test suite can provide immense value for application developers (and their downstream users) by providing minimal quality assurance on specific known inputs. These inputs can be critical to the business case (as in the first two examples in the previous section), tangential (as in the final example), or explicitly adversarial (as in jailbreaking and other attacks by downstream consumers).
However, they may fail to provide those same quality guarantees on extremely similar inputs to the model. While not widely studied in a formal setting, it is known that ChatGPT and various other consumer language models are sensitive to slight differences in training punctuation (e.g.,?whether a question is ended with a question mark or the number of newline characters used to separate paragraphs). In fact, attention is sometimes “wasted” on these meaningless stray punctuation tokens.6
Sometimes, though, this is by design. For instance, language models which have not been instruction-tuned fundamentally should provide different sentence continuations for each of the following:
Similarly, an instruction prompt of the form “How many characters long is the following text: I am happy” might have its meaning or expected output changed with even minor changes to punctuation, as well as seemingly harmless changes such as replacing with synonyms.
Examples of augmentations
Some simple augmentations that can be applied to test suites include:
In very infrequent cases, some of these augmentations may not actually be applicable to the problem domain (in which case it is a false positive test failure), though these are easily caught and disabled by the application developer, as major failing tests or augmentations are reported back to the developer.
Pseudocode for augmenting test suites
def subindex(index):
# index: 1
for i in itertools.count():
yield f"{index}.{i}" # 1.1
def augment_single(test, augmentations):
# index: 1,
# input_prompt: "hello",
# output_constraints: [
# Constraint.Positive,
# Constraint.Energetic
# ]
index, input_prompt, output_constraints = test
index_ctr = subindex(index)
for every combination of augmentations:
in_aug = augment(input_prompt)
yield next(index_ctr), in_aug, output_constraints
def augment_all(tests):
all_tests = set()
add all augmentations to set,
removing duplicates except in index
return all_tests
Conclusion
The LLM Toolchain architecture enables LLM application developers to evaluate their models prior to deployment, limit regressions in behavior, assure minimal quality of output, describe expected behavior of the model in natural language, and automatically improve their models. Moreover, they can do this all without any explicit labeled data, human feedback, or a single line of application developer-written code.
This architecture empowers businesses to confidently integrate language models into their production applications while maintaining control over their output and minimizing risk from unintended consequences. With the ability to describe expected behavior in natural language, organizations can align these models with their specific requirements and ethical guidelines, without the need for a highly-skilled machine learning engineer or human reviewers in the loop.
The LLM Toolchain architecture is a step towards a more robust, responsible, and effective use of language models and continual breakthroughs in the rapidly-evolving field of Generative Artificial Intelligence.
Glossary
Engineered prompt
Humans providing a template of a prompt to feed into the base LLM. For example, if an application developer is building a text summarization model, they could attempt to do it without any model training by using the engineered prompt
f"Summarize the following text: {text}.\n\nSummary:"
This is easier and less technically involved than fine-tuning or prompt tuning. Engineered prompts encompass the related concept of in-context learning, in which sample inputs and outputs are provided as part of the engineered prompt.
Safe
The output of the model is not deemed harmful. It is not deemed unethical or illegal to return to the end user (e.g.,?libel).
Quality of Outputs
There may be many possible outputs which are all deemed safe, but only few of them are truthful. Of the truthful outputs, only a few of them are written in a tone and have a perspective which the application developer would like to use in their deployments. Quality is a mechanism to rank potential outputs based on these non-tangible attributes. There are many ways of determining quality, including mathematical metrics like perplexity and AI-based analysis of outputs (e.g.,?as in Microsoft’s LLM-Augmenter paper).
For example, a hypothetical chatbot over Oracle’s Product Documentation needs to provide a response including “Document Understanding” when asked “What AI services does OCI offer?” (Minimum Quality Assurance).
Non-Functional Evaluation
Choosing an LLM model is not based solely on accuracy. There are other factors including risk of harmful output, compute and memory requirements, monetary cost of deployment, latency of inference, etc.
Test Suite Augmentation
Described above: a method for extending a developer-provided Test Suite by adding adversarial samples.
Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF)
Related techniques based on reinforcement learning in which models are fine-tuned based on human or artificial intelligence feedback/criticism of model outputs. This is in contrast to supervised fine-tuning, which (in essence) demands that an output be exactly some particular output known in advance, and self-supervised fine-tuning, which takes domain-specific corpora and reconditions the priors of the model to mirror the probability distribution of ngrams in the domain (i.e.,?outputs “look like” text found in that domain, but there is no evaluation of whether these outputs are actually consistent to the use case).
Hallucination
The trait of large language models that they emit grossly false information as if it were the truth. This behavior fundamentally arises out of the implementation of large language models; these models are trained to solve next-token prediction or similar tasks, whereas they are often used to produce new output with an expectation that it is grounded in reality. This is a fundamental mismatch between the training and deployment tasks.
NSV Mastermind | Enthusiast AI & ML | Architect Solutions AI & ML | AIOps / MLOps / DataOps | Innovator MLOps & DataOps for Web2 & Web3 Startup | NLP Aficionado | Unlocking the Power of AI for a Brighter Future??
1 年Sounds like a game-changer! Can't wait to see the impact. ?
Great discussion on LLM Avi!
Learn Mentor Empower
1 年Go Avi!!!! Nicely done.