LLMs store data using Vector DB. Why and how ?
馬Antony 裕杰
Cybersecurity SaaS developer. Web Isolation and Augmented Whitelisting. #Insurtech #decentralised_insurance
Traditionally, computing has been deterministic, which refers to the inherent consistency, repeatability, and provability of outcomes in data processing. This is because the output strictly adheres to the programming logic (code written by software developers).
#LLMs leverage similarity search to process information. During the training phase, LLMs identify similarities among text tokens and create an extensive neural network to capture these patterns. These patterns are then represented in a high-dimensional vector space, allowing for a more nuanced understanding of textual data.
When processing user input, input sentence is converted into a vector. The OpenAI software then searches for the nearest tokens in the multi-dimensional space, using the shortest distance as a measure of similarity. There is no developers doing the coding to write different logic. Each sentence is following same vector search process.
Below I show how a sentence is transform into a vector. The model (all-MiniLM-L6-v2) in this example is using 768 dimensions. OpenAI has its own API endpoint that do similar processing (text-embedding-ada-002) with 1536 dimensions.
>>> from sentence_transformers import SentenceTransforme
>>> sentences = ["We are at the dawn of a new era...", "Each sentence is converted"]
>>>?
>>> model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
>>> embeddings = model.encode(sentences)
>>> print(embeddings)
[[-9.31226462e-03 -5.73424622e-03? 2.18255073e-02 -8.62214249e-03
? -1.94084086e-02? 2.32371558e-02 -3.61166969e-02 -5.94260842e-02
?? 8.96405503e-02? 8.60120635e-03?.... .... ] [... .... ]]
Closing Thoughts:
领英推荐
For me, the moment of awakening comes as we transition into non-deterministic computing. When AI tools can no longer provide the same response every time, are we prepared to accept the implications, and what kind of risk controls will be applicable?
Don't miss out on future insights and discussions – subscribe to Oracle DBA's AI, LLM journey to stay up-to-date on the latest trends.
Reference :
OpenAI embedding doc (https://openai.com/blog/new-and-improved-embedding-model )
Pervasive Technology Institute at Indiana University "Introduction of document similarity" video (https://www.youtube.com/watch?v=MvG4dPplrRo)