登录查看更多内容

Re-training Strategy for fine-tuned LLMs

Debmalya Biswas

AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA

发布日期: 2023年11月19日

+ 关注

Change Data Capture (CDC) for Re-training LLMs / SLMs

Introduction?—?Fine-tuning LLMs

ChatGPT has been the talk of the town, ever since its release in Nov last year. The momentum is only accelerating with the release of multi-modal GPT-4, and competitive models by Google LaMDA and Meta AI LLaMA. Enterprise adoption of generative models is also picking up via their integration with office productivity software, e.g., Microsoft 365 Copilot and Google Docs.

GPTs (Generative Pre-training Transformers) belong to a class of foundational models (act as a decoder), which need to be fine-tuned to accomplish NLP tasks, such as:

Question-Answering (QA)/Chatbots
Text extraction
Summarization
Auto-correct
Translation
Classification

ChatGPT [1] is thus the Chatbot application of GPT-3 LLM. It is based on the InstructGPT released by OpenAI in January 2022.

Large Language Models (LLMs) underlying ChatGPT are trained on public datasets, e.g., Wikipedia. Given the controversial copyright issues around training on public datasets, GPT-4 does not even declare the underlying datasets it is trained on. We have also started seeing domain specific LLMs, e.g., BioGPT by Microsoft Research, which is fine-tuned to target Biomedical Text Generation and Mining.

To realize the full potential of Generative AI for Enterprises, the LLMs need to be contextualized [2] with enterprise knowledge captured in terms of documents, wikis, business processes, etc.

This contextualization is achieved in most cases by fine-tuning a Large Language Model (LLM), with enterprise data, creating a domain specific Small Language Model (SLM). Technically, this implies updating the weights of the last layer(s) of a trained neural network to reflect the enterprise data and task. This allows building upon what the base LLM has learned before.

In this article, we focus on LLM fine-tuning and discuss how the fine-tuning process can be optimized by incrementally re-training the LLM/SLM, only when the interestingness of new information available exceeds a certain threshold.

LLMOps for LLM Fine-tuning

The below figure shows a reference LLMOps (MLOps for LLMs) [3] pipeline for fine-tuning LLMs.

Fig: LLMOps pipeline for LLM Fine-tuning

In addition to the usual model monitoring blocks, LLMOps can be considered as more complex than usual MLOps pipelines due to the continuous improvement feedback loop?—?Reinforcement Learning from Human Feedback (RLHF) [4]. LMFlow (link) is a good example of an emerging MLOps framework for LLMs.

LLM Re-training Strategy

In this section, we get to the crux of the problem:

How often should I re-train by fine tuned LLM?

Should I re-train it for every new document that gets added to the KnowledgeBase? Or should I wait until X number of new documents are available? What is the right X in this case? The default strategy here is to re-train it at regular intervals, e.g., monthly or quarterly. However, this is clearly not optimal, esp., for use-cases that rely on up-to-date / (near) real-time information.

Re-training is a computationally expensive process, as each iteration of re-training might take many hours (even days) of processing on GPUs for a complex neural network architecture and a large training dataset. So triggering a re-training needs to be carefully evaluated, and should be a quantified decision.

We propose to trigger LLM re-training by analyzing the new information / documents available?—?quantifying the “interestingness” of information contained in those documents.

Change Data Capture?(CDC)

We take inspiration from the Structured Data (mostly SQL) world, esp. Data Historization and Change Data Capture (CDC).

Data Historization is the process of keeping track of data changes over time. There are primarily two data historization methods:

Slowly Changing Dimensions Type 2 (SCD2): A history of changes adds a new record for an identifier every time there is a change in one or more column values. For example, refer to [6, 7] for details of implementing SCD2 on AWS and Oracle Cloud platforms respectively.
A history of snapshots adds a new record for an identifier every time a data delivery from the source system arrives, whether it contains a change or not.

CDC is a method of detecting and extracting new / updated records in the source and loading only this new data into the destination data store. Determining the “changed data” and ingesting just the delta optimises the ETL pipeline. There are primarily three approaches to perform CDC: log-based, query-based, and trigger-based.

Log-based: CDC updates a log for every INSERT, UPDATE or DELETE and reads that information when it is time to insert into the target DB.
Trigger-based: is simply “triggering” a trigger after every operation with the same result.
Query-based: CDC leverages explicit queries to find differences between datasets and can be very resource / compute intensive to implement for large data stores.

In short, the key is to optimize the process of performing incremental data loads by determining and focusing on the “new” records. In case of LLMs, we need to extend the same to unstructured data: tables to documents and records to tokens.

Quantifying information ‘‘interest level“

New information / insights comes from newly published documents, articles, product releases, strategy documents, roadmaps, etc. To quantify novel insights, related to a domain, and having sufficient differentiation; we apply the below methodology.

Given a ‘domain of interest’, the first pre-processing step consists of creating the:

· Ontology: captures the relevant domain keywords / concepts, and the relationships among them. We consider a basic ontology representation in the form of a hierarchy of keywords?—?classes?—?concepts.

Shushant Lakhyani 4 个月前

How ChatGPT Became Possible - Rise of LLMs

Michael Spencer 1 年前

The AI Vanguard Newsletter #6

Danny Butvinik 1 年前

· Flowchart: represents the processes / workflows underlying known solutions in the area. We consider a flowchart representation as a sequence of ‘Process’ nodes, occasionally segregated into alternate execution paths by ‘Decision’ nodes.

Note that there can be more than one ontology / flowchart defined per domain. Each new published document is then analyzed to compute the following scores:

· Novelty (‘N’): The novelty of a document is computed in terms of the number of unique and novel combinations of adjoint words, referred to as n-grams, which appear in the document. Here, an n-gram is considered to be novel if its number of occurrences in existing document corpora (on which the LLM has already been trained / fine-tuned) is below a user specified threshold.

· Proximity (‘P’)): captures the “closeness” of a document to the ontology. Here, “closeness” is measured in terms of the shared words (in n-grams)?—?occurring in both the domain ontology and the document. The co-occurrences can be further weighted according to their frequency of occurrence in the document.

We have used n-grams for simplicity here. Similar logic applies to considering the distance (e.g., cosine similarity) between document embedding and domain embedding.

· Impact (‘I’)): The impact is computed in terms of the number of nodes in a sub-flowchart of the domain flowchart, where the new document provides an alternative (most likely, a more efficient alternative) to the process / method outlined in the sub-flowchart. A sample approach to compute the flowchart nodes affected by a new document consists of matching

tags extracted from the document?—?characterizing the solution described in the document with respect to
tags extracted from the flowchart nodes’ descriptions.

The interest score ‘IN?(d)’ of a document d, with respect to domain D, with corresponding ontology O and flowchart F; is computed as a function of the above scores, i.e.

IN?(d) = f?(N(d), P(d), I(d))

The user interface consists of aggregating a stream of such documents, sorting, and considering them for re-training in descending order of their ‘IN’ scores.

Merge the ‘interesting parts’ of Documents

A single document is often not sufficiently interesting, i.e., it does not have a high enough ‘IN’ score. We thus consider approaches in this section to merge the interesting (and novel) insights embedded in multiple documents - to reach a sufficiently high 'IN' score to trigger re-training.

This further optimises the re-training process by considering only the consolidated (interesting) parts of the documents for re-training, rather than the full individual documents.

Let (O?, F?) and (O?, F?) refer to the (ontology, flowchart) corresponding to domains D? and D? (can be same as D?), respectively. The algorithm is illustrated in the figure below:

Fig: Computation of Composite Interest Score corresponding to Multiple Documents

Input. Documents d? and d?, from document streams corresponding to domains D? and D?, respectively. This implies that both documents have already been processed to compute the ‘IN’ scores: IN??(d?) and IN??(d?).

· Composite Novelty (‘CN’): The ‘CN’ score (of documents d? and d?) is computed in terms of the number of shared n-grams occurring in both d? and d?. Let c(d?) and c(d?) denote the set of unique (novel) n-grams in d? and d? respectively, such that N(d?) = |c(d?)| and N(d?) = |c(d?)|. Then,

CN(d?, d?)?:= | c(d?) ∩ c(d?)|

· Composite Proximity (‘CP’)): While c(d?, d?) captured the “closeness” between documents d? and d?, the ‘CP’ score aims to capture the “closeness” of documents d? and d? w.r.t. ontologies O? and O?.? ? ?

CP??,??(d?, dy):= P??(d?) + P??(d?) + P??(d?) + P??(d?) ? ?

Different scenarios are possible here:

Document d? overlaps with O?, but d? does not overlap with O?, i. e. P??(d?) ≠ 0 and P??(d?) = 0. This implies that a composite document merging d? and d? will be “closer” to O? than O?.
Document d? overlaps with O?, but d? does not overlap with O?, i. e. P??(d?) ≠ 0 and P??(d?) = 0. This implies that a composite document merging d? and d? is “closer” to O? than O?.
Both documents d? and d??—?share overlapping n-grams?—?with both ontologies O? than O?. This leads to the highest ‘CP’ score, and implies that considering documents d? and d? together might provide a new (composite) insight merging concepts from domains O? and O?.
Neither document d? nor dy shares any overlap with ontology O? or O?, respectively. That is, P??(d?) = P??(d?) = 0, and combining them will not lead to any additional insights w.r.t. either O? or O?.

· Composite Impact (‘CI’)): measures the feasibility of combining techniques described in documents d? and d??—?into an integrated solution / method.? ? ?Assuming there exists a flowchart, either F? or F?, sub-flowcharts F??, F??, F??, F?? of which can be optimized by the technology described in documents d? / d?; the ‘CI’ score is computed as follows:

where F??? and F??? denote the common (shared) sub-flowcharts of F? and F? respectively?—?affected by both documents d? and d?. Note that in case F? is not affected by document d?, or F? is not affected by document d?; F?? / F?? will be an ‘empty’ flowcharts (with 0 nodes). It is also possible that F?? is a sub-flowchart of F?? (or F?? is a sub-flowchart of F??), or vice-versa. The scenario where F?? = F?? (F?? = F??) implies that both F?? and F?? (F?? and F??) are alternatives to the same problem?—?affecting the same nodes in F? (F?). This basically implies that documents d? and d? can be considered independently?—?leading to a low ‘CI’ score.

The composite invention score ‘CIN’ of documents d? and d?, w.r.t. domains D? and D?, is computed as a function of the above scores, i.e.

Fig: Sample composite flowchart — with “matching” sub-flowcharts of

References

D. Biswas. ChatGPT, and its implications for Enterprise AI. In Data Driven Investor (link)
D. Biswas. Contextualizing Large Language Models (LLMs) with Enterprise Data. In Data Driven Investor (link)
D. Biswas. LLMOps Architecture Patterns. In Data Driven Investor (link)
E. Ricciardelli, D. Biswas. Self-improving Chatbots based on Reinforcement Learning. in: 4th Multidisciplinary Conference on Reinforcement Learning and Decision Making, 2019. (link)
D. Greenshtein. Build Slowly Changing Dimensions Type 2 (SCD2) with Apache Spark and Apache Hudi on Amazon EMR, 2021 (link)
A. Duvuri. Slowly Changing Dimensions (SCD) Type 2 Implementation in Oracle Cloud Infrastructure (OCI) Data Integration, 2020 (link)

Libby Murygin

Partner | Bellevue Anti-Aging Medical Clinic | @Libby

6 个月

thanks for the insights. Is there no other way around this like model merging a new small model with the novel data?

Debmalya Biswas

AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA

11 个月

Also, published in the DataDrivenInvestor https://medium.datadriveninvestor.com/re-training-strategy-for-fine-tuned-llms-0f85992830c3

查看更多评论

要查看或添加评论，请登录

Debmalya Biswas的更多文章

Unifying Data & Gen AI / LLM platforms

2024年10月16日

Unifying Data & Gen AI / LLM platforms

AI / Gen AI challenges for a Data platform As a Data and AI/ML practitioner, I have always wondered as to why we have…

2 条评论
Conversational BI with Snowflake's Cortex Analyst

2024年10月3日

Conversational BI with Snowflake's Cortex Analyst

I have previously written about Conversational BI and the challenges in realizing them. With large language models…

9 条评论
Stateful and Responsible AI?Agents

2024年8月25日

Stateful and Responsible AI?Agents

Introduction to AI Agents The discussion around ChatGPT, has now evolved into AutoGPT. While ChatGPT is primarily a…

11 条评论
Conflicting Prompts, and the challenges in building Enterprise Prompt Stores

2024年8月17日

Conflicting Prompts, and the challenges in building Enterprise Prompt Stores

Introduction Prompts today are the primary mode of interaction with large language models (LLMs). Prompts need to be…

5 条评论
LLM Personalization: User Persona based Personalization of LLM generated Responses

2024年8月11日

LLM Personalization: User Persona based Personalization of LLM generated Responses

Introduction ChatGPT, or the underlying Large Language Models (LLMs) today, are able to generate contextualized…

5 条评论
Use-case based evaluation of LLMs

2024年7月21日

Use-case based evaluation of LLMs

Introduction We are at a critical juncture in the Generative AI adoption journey, where we are have started hearing…

6 条评论
Gen AI Privacy: Privacy Risks of LLMs

2024年7月6日

Gen AI Privacy: Privacy Risks of LLMs

Machine Learning (ML) Privacy Risks Let us first consider the Privacy attack scenarios in a traditional Supervised ML…

8 条评论
Responsible LLMOps: Integrating Responsible AI practices into LLMOps

2024年6月16日

Responsible LLMOps: Integrating Responsible AI practices into LLMOps

Abstract. While we see growing adoption of both LLMOps & Responsible AI practices in Gen AI implementations, the…

6 条评论
Delta Lake, Iceberg & Hudi: A Transactional Perspective

2024年6月9日

Delta Lake, Iceberg & Hudi: A Transactional Perspective

Abstract. Transactions with their ACID guarantees used to be the backbone of Database Management Systems.

2 条评论
LLMOps-Monitoring for Agent AI Platforms

2024年6月8日

LLMOps-Monitoring for Agent AI Platforms

Introduction to AI Agents The discussion around ChatGPT, has now evolved into AI Agents. Bill Gates recently envisioned…

3 条评论

See all articles

Re-training Strategy for fine-tuned LLMs

Debmalya Biswas

AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA

Introduction?—?Fine-tuning LLMs

LLMOps for LLM Fine-tuning

LLM Re-training Strategy

Change Data Capture?(CDC)

Quantifying information ‘‘interest level“

领英推荐

Merge the ‘interesting parts’ of Documents

References

Debmalya Biswas的更多文章

社区洞察

其他会员也浏览了

From GPT-4 to Microsoft 365 Copilot

GPT-based Models Meet Simulation; Survey on ChatGPT And Beyond; Transformer Architecture Of GPT Models; and More.

Chat GPT-4: Open AI’s most Recent and Advance Launch

GPT-4 and the Quest for Human-Like Intelligence in AI

LLM Economics: Which is Cheaper to deploy ChatGPT Vs Open Source LLMs?

GPT-4 Takes on its Predecessor: A Comprehensive Comparison of ChatGPT 3.5 and 4

AI News Bytes: ChatGPT 4; Kosmos-1; Amazon outperforms GPT-3.5 by 16%; ChatLLaMA; ChatGPT Alternative Released.....

Mistral AI Launches High-Performance AI Model and Chatbot to Challenge ChatGPT, Claude and Gemini!

For builders of Interactive AI like ChatGPT, this new method makes Large Language Models economical

What's all the Hype around Prompt Engineering About?

Introduction?—?Fine-tuning LLMs

LLMOps for LLM Fine-tuning

LLM Re-training Strategy

Change Data Capture?(CDC)

Quantifying information ‘‘interest level“

领英推荐

Merge the ‘interesting parts’ of Documents

References

Debmalya Biswas的更多文章

Unifying Data & Gen AI / LLM platforms

Conversational BI with Snowflake's Cortex Analyst

Stateful and Responsible AI?Agents

Conflicting Prompts, and the challenges in building Enterprise Prompt Stores

LLM Personalization: User Persona based Personalization of LLM generated Responses

Use-case based evaluation of LLMs

Gen AI Privacy: Privacy Risks of LLMs

Responsible LLMOps: Integrating Responsible AI practices into LLMOps

Delta Lake, Iceberg & Hudi: A Transactional Perspective

LLMOps-Monitoring for Agent AI Platforms

社区洞察

其他会员也浏览了

From GPT-4 to Microsoft 365 Copilot

GPT-based Models Meet Simulation; Survey on ChatGPT And Beyond; Transformer Architecture Of GPT Models; and More.

Chat GPT-4: Open AI’s most Recent and Advance Launch

GPT-4 and the Quest for Human-Like Intelligence in AI

LLM Economics: Which is Cheaper to deploy ChatGPT Vs Open Source LLMs?

GPT-4 Takes on its Predecessor: A Comprehensive Comparison of ChatGPT 3.5 and 4

AI News Bytes: ChatGPT 4; Kosmos-1; Amazon outperforms GPT-3.5 by 16%; ChatLLaMA; ChatGPT Alternative Released.....

Mistral AI Launches High-Performance AI Model and Chatbot to Challenge ChatGPT, Claude and Gemini!

For builders of Interactive AI like ChatGPT, this new method makes Large Language Models economical

What's all the Hype around Prompt Engineering About?