Contextualizing Large Language Models (LLMs) with Enterprise Data
Debmalya Biswas
AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA
Introduction
ChatGPT has been the talk of the town, ever since its release in Nov. The momentum is only accelerating with the release of multi-modal GPT-4, and competitive models by Google LaMDA and Meta AI LLaMA. Enterprise adoption of generative models is also picking up via their integration with office productivity software, e.g., Microsoft 365 Copilot and Google Docs.
GPTs (Generative Pre-training Transformers) belong to a class of foundational models (act as a decoder), which need to be fine-tuned to accomplish NLP tasks, such as:
ChatGPT [1] is thus the Chatbot application of GPT-3 LLM. It is based on the InstructGPT released by OpenAI in January.
Large Language Models (LLMs) underlying ChatGPT are trained on public datasets, e.g., Wikipedia. Given the controversial copyright issues around training on public datasets, GPT-4 does not even declare the underlying datasets it is trained on. We have also started seeing domain specific LLMs, e.g., BioGPT by Microsoft Research, which is fine-tuned to target Biomedical Text Generation and Mining.
To realize the full potential of Generative AI for Enterprises, the LLMs need to be contextualized with enterprise knowledge captured in terms of documents, wikis, business processes, etc.
There are primarily three approaches to achieve this enterprise contextualization: (1) Prompt Engineering, (2) Fine-tuning, and Reinforcement Learning from Human Feedback (RLHF). We discuss the pros, cons, and feasibility of the three approaches in the sequel.
Prompt Engineering
Any Chatbot [2], at a very high level, consists of the following steps:
Prompt Engineering refers to adapting the user query (in natural language), providing the right enterprise context and guidance to the NLU and NLG engines - to maximize the chances of getting the 'right' response.
It has led to the rise of Prompt Engineering as a professional discipline, where prompt engineers systematically perform trials, recording their findings, to arrive at the 'right' prompt to elicit the 'best' response.
Unfortunately, prompt engineering is not a scalable approach in my opinion. It is analogous to the age-old keyword based search, where the onus is on the user to provide the right keywords / context.
However, it might be the only feasible approach to add enterprise context / knowledge to closed systems, such as, ChatGPT; where the only way to access the underlying LLM is via a Web interface or API.
The prompts can be relatively long, so it is possible to embed some enterprise context as part of the prompt. For instance, this is the current recommended approach to provide enterprise context / knowledge to ChatGPT on Azure (link). Referring to the solution architecture below, the recommendation is to basically provide the Cognitive Search results as part of the Prompt to ChatGPT.
领英推荐
Fine-tuning
Fine-tuning primarily refers to Transfer Learning (TL) that allows building upon what the base model has learned before. We can take the features learned by a model and retrain them to new scenarios without having to retrain the model from scratch on the original dataset. This is important because each iteration of retraining might take many hours of processing on GPUs for a complex neural network architecture and a large training dataset.
In enterprise context, fine-tuning entails taking a pre-trained Large Language Model (LLM), and retraining it with (smaller) enterprise data. Technically, this implies updating the weights of the last layer(s) of the trained neural network to reflect the enterprise data and task.
Given this, access to the base model weights is needed to perform fine-tuning, which is not possible for closed models, e.g., ChatGPT.
This is where open-source pre-trained LLMs come to the rescue. Thanks to Meta AI - who recently open-sourced their LLM - LLaMA [4].
The Stanford Alpaca project showed that it is possible to fine-tune LLaMA for $600 - to a model performance comparable with ChatGPT. So fine-tuning a LLM does not necessarily need to be very complex or expensive.
This of course assumes that the enterprise has the necessary data annotated to be used for fine-tuning / retraining. The Alpaca training recipe is available here (link). The team used something very interesting called self-instruct [5] to generate the dataset for fine-tuning. The figure below illustrates the training data generation process.
Starting with 175 human-written instruction-output pairs, text-davinci-003 (OpenAI GPT 3.5) model was prompted to generate more instructions. The data generation process resulted in 52K unique instructions and corresponding outputs, which were used in a supervised fashion to fine-tune the underlying LLaMA model. Generative models have previously been used to generate synthetic data [6].
The Stanford Alpaca training process is particularly interesting, as it leads to the promise of self-tuning, where the output of a generative model can be used to train another generative model.
Unfortunately, the process is not fully automated, and manual intervention is still needed. Machines are not taking over, just not yet :-)
Reinforcement Learning from Human Feedback (RLHF)
LLMs, including ChatGPT, make extensive use of RLHF to improve their accuracy. Reinforcement Learning (RL) is a powerful technique that is able to achieve complex goals by maximizing a reward function in real-time. The reward function works similar to incentivizing a child with candy and spankings, such that the algorithm is penalized when it takes a wrong decision and rewarded when it takes a right one — this is reinforcement.
At the core of this approach [7] is a score model, which is trained to score chatbot query-response tuples based on (manual) user feedback. The scores predicted by this model are used as rewards for the RL agent. Proximal Policy Optimization is then used as a final step to further tune ChatGPT.
In short, retraining or adding new information to LLMs is not fully automated. RL based training remains a complex task and manual intervention?is still needed to perform this in a targeted fashion and protect against bias / manipulation.?Refer to [8] for a discussion on LLMOps architecture patterns to build enterprise LLMs.
References
Senior Drupal Developer
1 年Thanks. This is really an interesting read.
AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA
1 年Also, published in the DataDrivenInvestor: https://lnkd.in/ebfPg5Nk
Research Manager and Senior Research Scientist in SINTEF’s Trustworthy Green IoT Software Group | ex-Entrepreuneur | Ph.D. (INRIA)
1 年Thanks Deb! Very relevant to what many are trying to do. What are your thoughts around privacy while sharing data with LLMs through a prompt for instance? I guess there is a way to anonymize personal info and still acheive good results.