Re-training Strategy for fine-tuned LLMs
Debmalya Biswas
AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA
Change Data Capture (CDC) for Re-training LLMs / SLMs
Introduction?—?Fine-tuning LLMs
ChatGPT has been the talk of the town, ever since its release in Nov last year. The momentum is only accelerating with the release of multi-modal GPT-4, and competitive models by Google LaMDA and Meta AI LLaMA. Enterprise adoption of generative models is also picking up via their integration with office productivity software, e.g., Microsoft 365 Copilot and Google Docs.
GPTs (Generative Pre-training Transformers) belong to a class of foundational models (act as a decoder), which need to be fine-tuned to accomplish NLP tasks, such as:
ChatGPT [1] is thus the Chatbot application of GPT-3 LLM. It is based on the InstructGPT released by OpenAI in January 2022.
Large Language Models (LLMs) underlying ChatGPT are trained on public datasets, e.g., Wikipedia. Given the controversial copyright issues around training on public datasets, GPT-4 does not even declare the underlying datasets it is trained on. We have also started seeing domain specific LLMs, e.g., BioGPT by Microsoft Research, which is fine-tuned to target Biomedical Text Generation and Mining.
To realize the full potential of Generative AI for Enterprises, the LLMs need to be contextualized [2] with enterprise knowledge captured in terms of documents, wikis, business processes, etc.
This contextualization is achieved in most cases by fine-tuning a Large Language Model (LLM), with enterprise data, creating a domain specific Small Language Model (SLM). Technically, this implies updating the weights of the last layer(s) of a trained neural network to reflect the enterprise data and task. This allows building upon what the base LLM has learned before.
In this article, we focus on LLM fine-tuning and discuss how the fine-tuning process can be optimized by incrementally re-training the LLM/SLM, only when the interestingness of new information available exceeds a certain threshold.
LLMOps for LLM Fine-tuning
The below figure shows a reference LLMOps (MLOps for LLMs) [3] pipeline for fine-tuning LLMs.
In addition to the usual model monitoring blocks, LLMOps can be considered as more complex than usual MLOps pipelines due to the continuous improvement feedback loop?—?Reinforcement Learning from Human Feedback (RLHF) [4]. LMFlow (link) is a good example of an emerging MLOps framework for LLMs.
LLM Re-training Strategy
In this section, we get to the crux of the problem:
How often should I re-train by fine tuned LLM?
Should I re-train it for every new document that gets added to the KnowledgeBase? Or should I wait until X number of new documents are available? What is the right X in this case? The default strategy here is to re-train it at regular intervals, e.g., monthly or quarterly. However, this is clearly not optimal, esp., for use-cases that rely on up-to-date / (near) real-time information.
Re-training is a computationally expensive process, as each iteration of re-training might take many hours (even days) of processing on GPUs for a complex neural network architecture and a large training dataset. So triggering a re-training needs to be carefully evaluated, and should be a quantified decision.
We propose to trigger LLM re-training by analyzing the new information / documents available?—?quantifying the “interestingness” of information contained in those documents.
Change Data Capture?(CDC)
We take inspiration from the Structured Data (mostly SQL) world, esp. Data Historization and Change Data Capture (CDC).
Data Historization is the process of keeping track of data changes over time. There are primarily two data historization methods:
CDC is a method of detecting and extracting new / updated records in the source and loading only this new data into the destination data store. Determining the “changed data” and ingesting just the delta optimises the ETL pipeline. There are primarily three approaches to perform CDC: log-based, query-based, and trigger-based.
In short, the key is to optimize the process of performing incremental data loads by determining and focusing on the “new” records. In case of LLMs, we need to extend the same to unstructured data: tables to documents and records to tokens.
Quantifying information ‘‘interest level“
New information / insights comes from newly published documents, articles, product releases, strategy documents, roadmaps, etc. To quantify novel insights, related to a domain, and having sufficient differentiation; we apply the below methodology.
Given a ‘domain of interest’, the first pre-processing step consists of creating the:
· Ontology: captures the relevant domain keywords / concepts, and the relationships among them. We consider a basic ontology representation in the form of a hierarchy of keywords?—?classes?—?concepts.
领英推荐
· Flowchart: represents the processes / workflows underlying known solutions in the area. We consider a flowchart representation as a sequence of ‘Process’ nodes, occasionally segregated into alternate execution paths by ‘Decision’ nodes.
Note that there can be more than one ontology / flowchart defined per domain. Each new published document is then analyzed to compute the following scores:
· Novelty (‘N’): The novelty of a document is computed in terms of the number of unique and novel combinations of adjoint words, referred to as n-grams, which appear in the document. Here, an n-gram is considered to be novel if its number of occurrences in existing document corpora (on which the LLM has already been trained / fine-tuned) is below a user specified threshold.
· Proximity (‘P’)): captures the “closeness” of a document to the ontology. Here, “closeness” is measured in terms of the shared words (in n-grams)?—?occurring in both the domain ontology and the document. The co-occurrences can be further weighted according to their frequency of occurrence in the document.
We have used n-grams for simplicity here. Similar logic applies to considering the distance (e.g., cosine similarity) between document embedding and domain embedding.
· Impact (‘I’)): The impact is computed in terms of the number of nodes in a sub-flowchart of the domain flowchart, where the new document provides an alternative (most likely, a more efficient alternative) to the process / method outlined in the sub-flowchart. A sample approach to compute the flowchart nodes affected by a new document consists of matching
The interest score ‘IN?(d)’ of a document d, with respect to domain D, with corresponding ontology O and flowchart F; is computed as a function of the above scores, i.e.
IN?(d) = f?(N(d), P(d), I(d))
The user interface consists of aggregating a stream of such documents, sorting, and considering them for re-training in descending order of their ‘IN’ scores.
Merge the ‘interesting parts’ of Documents
A single document is often not sufficiently interesting, i.e., it does not have a high enough ‘IN’ score. We thus consider approaches in this section to merge the interesting (and novel) insights embedded in multiple documents - to reach a sufficiently high 'IN' score to trigger re-training.
This further optimises the re-training process by considering only the consolidated (interesting) parts of the documents for re-training, rather than the full individual documents.
Let (O?, F?) and (O?, F?) refer to the (ontology, flowchart) corresponding to domains D? and D? (can be same as D?), respectively. The algorithm is illustrated in the figure below:
Input. Documents d? and d?, from document streams corresponding to domains D? and D?, respectively. This implies that both documents have already been processed to compute the ‘IN’ scores: IN??(d?) and IN??(d?).
· Composite Novelty (‘CN’): The ‘CN’ score (of documents d? and d?) is computed in terms of the number of shared n-grams occurring in both d? and d?. Let c(d?) and c(d?) denote the set of unique (novel) n-grams in d? and d? respectively, such that N(d?) = |c(d?)| and N(d?) = |c(d?)|. Then,
CN(d?, d?)?:= | c(d?) ∩ c(d?)|
· Composite Proximity (‘CP’)): While c(d?, d?) captured the “closeness” between documents d? and d?, the ‘CP’ score aims to capture the “closeness” of documents d? and d? w.r.t. ontologies O? and O?.? ? ?
CP??,??(d?, dy):= P??(d?) + P??(d?) + P??(d?) + P??(d?) ? ?
Different scenarios are possible here:
· Composite Impact (‘CI’)): measures the feasibility of combining techniques described in documents d? and d??—?into an integrated solution / method.? ? ?Assuming there exists a flowchart, either F? or F?, sub-flowcharts F??, F??, F??, F?? of which can be optimized by the technology described in documents d? / d?; the ‘CI’ score is computed as follows:
where F??? and F??? denote the common (shared) sub-flowcharts of F? and F? respectively?—?affected by both documents d? and d?. Note that in case F? is not affected by document d?, or F? is not affected by document d?; F?? / F?? will be an ‘empty’ flowcharts (with 0 nodes). It is also possible that F?? is a sub-flowchart of F?? (or F?? is a sub-flowchart of F??), or vice-versa. The scenario where F?? = F?? (F?? = F??) implies that both F?? and F?? (F?? and F??) are alternatives to the same problem?—?affecting the same nodes in F? (F?). This basically implies that documents d? and d? can be considered independently?—?leading to a low ‘CI’ score.
The composite invention score ‘CIN’ of documents d? and d?, w.r.t. domains D? and D?, is computed as a function of the above scores, i.e.
References
Partner | Bellevue Anti-Aging Medical Clinic | @Libby
6 个月thanks for the insights. Is there no other way around this like model merging a new small model with the novel data?
AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA
11 个月Also, published in the DataDrivenInvestor https://medium.datadriveninvestor.com/re-training-strategy-for-fine-tuned-llms-0f85992830c3