Part 4: Advanced RAG – Finetune Embedding Model

Embeddings play a very important role in RAG pipeline. During retrieval phase, we use the embeddings of the chunks and user query, to fetch the relevant chunks based on similarity. So, representing the chunks appropriately in the embeddings would help to achieve better retrieval.


?

We have lot of state-of-the-art embedding models already available. Then, why do we need to finetune embedding models?

  • These models are trained on general purpose text data. So, these models can be good starting point for the RAG applications.
  • However, it won’t be good enough if we need to have a domain specific application using RAG.

So, if we train the pretrained embedding models on our domain specific data, this will result in better retrieval.

?

High level steps involved in finetuning the embedding models:

Generate Synthetic query dataset using LLMs. In this step, each context is passed to a LLM and requested for a Query for the context. Then the Query, Context pair is stored as a dataset.?

Synthetic dataset Generation

Then, use this synthetic dataset to directly finetune an embedding model.

In my code example, we have passed our VMware Cloud Foundation(VCF) documents to ?create synthetic dataset using generate_qa_embedding_pairs (this function makes this process easy). This function generates a JSON file with Queries, Corpus & Relevant docs.

?

Here is One example of the pair here. The query ID is linked to Relevant Docs and Relevant Doc ID is linked to the Corpus.

?

Query:

be4d353f-ebec-4d13-932f-1c83173a8db2

"How should the network settings, including IP address, subnet mask, default gateway, DNS servers, domain name, search paths, and NTP servers, be configured for the VMware Cloud Builder appliance deployment according to the VMware Cloud Foundation Deployment Guide?"

?

Relevant docs:

be4d353f-ebec-4d13-932f-1c83173a8db2

0

"391bd3cb-81c2-4ba6-8cc3-aadeaf779b5f"

?

Corpus:

391bd3cb-81c2-4ba6-8cc3-aadeaf779b5f

"3On the Select creation type dialog box, select Deploy a virtual machine from an OVF or OVA \nfile and click Next .\n4On the Select OVF and VMDK files page, enter a name for the virtual machine, select the \nVMware Cloud Builder .ova file, and click Next .\n5On the Select Storage page, select a datastore and click Next .\n6On the License agreements dialog box, click I agree and then click Next .\n7On the Select networks dialog box, enter the following values and click Next .\nSetting Value\nNetwork mappings your_portgroup\nDisk provisioning Thin\nPower on automatically Selected\n \n8On the Additional settings dialog box, expand Application , enter the following values, and \nclick Next .\nSetting Details\nAdmin Username Accept the default admin user name, admin .\nAdmin Password/Admin Password \nconfirmThe admin password must be a minimum of 8 characters and include at \nleast one uppercase, one lowercase, one digit, and one special character. \nSupported special characters:\n@ ! # $ % ? ^\nRoot password/Root password \nconfirmThe root password must be a minimum of 8 characters and include at \nleast one uppercase, one lowercase, one digit, and one special character. \nSupported special characters:\n@ ! # $ % ? ^\nHostname Enter the hostname for the VMware Cloud Builder appliance .\nNetwork 1 IP Address Enter the IP address for the VMware Cloud Builder appliance .\nNetwork 1 Subnet Mask Enter the subnet mask for the VMware Cloud Builder appliance .\nDefault Gateway Enter the default gateway for the VMware Cloud Builder appliance .\nDNS Servers Enter the IP address of the primary and secondary DNS servers (comma \nseparated). Do not specify more than two servers.\nDNS Domain Name Enter the DNS domain name. For example, vsphere.local .\nDNS Domain Search Paths Enter the DNS domain search path(s). Use a comma if entering multiple \nsearch paths. For example vsphere.local, sfo.vsphere.local .\nNTP Servers Enter the NTP server(s). Use a comma if entering multiple NTP servers. NTP \nservers can be entered using FQDNs or IP addresses.\n VMware Cloud Foundation Deployment Guide\nVMware, Inc. 8"


This generated dataset is then used to Finetune the bge-small-en embeddingnmodel (using our?SentenceTransformersFinetuneEngine).


Hope this is useful.

?

Joydeep Bhattacharjee

?? Are you working towards leveling up your career? DM me. Lets Discuss. ????

1 年

要查看或添加评论,请登录

Sanjaya Kanungo的更多文章

社区洞察

其他会员也浏览了