LLMs store data using Vector DB. Why and how ?
We are at the dawn of a new era... ...

LLMs store data using Vector DB. Why and how ?

Traditionally, computing has been deterministic, which refers to the inherent consistency, repeatability, and provability of outcomes in data processing. This is because the output strictly adheres to the programming logic (code written by software developers).

#LLMs leverage similarity search to process information. During the training phase, LLMs identify similarities among text tokens and create an extensive neural network to capture these patterns. These patterns are then represented in a high-dimensional vector space, allowing for a more nuanced understanding of textual data.

When processing user input, input sentence is converted into a vector. The OpenAI software then searches for the nearest tokens in the multi-dimensional space, using the shortest distance as a measure of similarity. There is no developers doing the coding to write different logic. Each sentence is following same vector search process.

Below I show how a sentence is transform into a vector. The model (all-MiniLM-L6-v2) in this example is using 768 dimensions. OpenAI has its own API endpoint that do similar processing (text-embedding-ada-002) with 1536 dimensions.

>>> from sentence_transformers import SentenceTransforme
>>> sentences = ["We are at the dawn of a new era...", "Each sentence is converted"]
>>>?
>>> model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
>>> embeddings = model.encode(sentences)
>>> print(embeddings)
[[-9.31226462e-03 -5.73424622e-03? 2.18255073e-02 -8.62214249e-03
? -1.94084086e-02? 2.32371558e-02 -3.61166969e-02 -5.94260842e-02
?? 8.96405503e-02? 8.60120635e-03?.... .... ] [... .... ]] 

        

Closing Thoughts:

For me, the moment of awakening comes as we transition into non-deterministic computing. When AI tools can no longer provide the same response every time, are we prepared to accept the implications, and what kind of risk controls will be applicable?

Don't miss out on future insights and discussions – subscribe to Oracle DBA's AI, LLM journey to stay up-to-date on the latest trends.

Reference :

OpenAI embedding doc (https://openai.com/blog/new-and-improved-embedding-model )

Pervasive Technology Institute at Indiana University "Introduction of document similarity" video (https://www.youtube.com/watch?v=MvG4dPplrRo)

要查看或添加评论,请登录

馬Antony 裕杰的更多文章

社区洞察

其他会员也浏览了