What is a vector search?
馬Antony 裕杰
Cybersecurity SaaS developer. Web Isolation and Augmented Whitelisting. #Insurtech #decentralised_insurance
The purpose of this newsletter is on assisting enterprise tech people transition into AI-era. It is also my learning diary. I was involve in a few abstract discussion on using generative AI and its impact. Sometimes, these discussions were so abstract that it is not rooted in facts or how LLMs work. Comparing and contrasting LLMs with RDBMS is one way to adjust our yardsticks and get ready to integrate AI tools in enterprise computing.
Vector search focuses on measuring the similarity between data points based on their vector representations. Let us look at how the vector representations are created in LLM.
In RDBMS we are interested in the exact values of our input. An egg from ABC farm selling at $4 on 1 Apr 2023 is different from a chicken selling at $40. The data input is strongly guarded and data quality is validated before writing into database.
In a language model,?the focus is on word relationships. For example : egg is a sellable item and has lower price than chicken is discovered. When building the language model, you cannot validate each text or each value. In large model like GPT3, it is not possible to curate the content. In LLMs, importance is on the word relationship. The model does not know the meaning of egg or chicken. But it can discover that if you are making cakes, it is more likely you are refer to an egg.
Neural networks are employed to uncover non-linear relationships among complex and unstructured words. This process is commonly referred to as training. FastText is a widely-used Python library that enables users to create a model using text input. It's important to note that neural networks are inherently non-deterministic, meaning that even with the same data set, each training session may result in models with slight differences.
This background is important to understand vector search as the question and answer in ChatGPT process is about converting our input into a vector and use this vector to ask the trained model to find similar words. A search vector is an representation of context and the input question in the multi-dimensional vector space created by a LLM.?
There are many ways to find similar or close vectors in multi-dimensional space.?Above diagram from Different types of Distances used in Machine Learning Explained! (note1) show the major types of distance calculation methods.
OpenAI's models, predominantly use cosine similarity for vector search. Cosine similarity is a popular choice because it calculates the similarity between two vectors based on the cosine of the angle between them, making it particularly effective in high-dimensional spaces.
I didn't aware of the importance of math when I was studying matrix manipulations 30 years ago!