The AI ToolBox #2: Vector Search in Machine Learning and AI
Paul-Benjamin Ramírez
Co-Founder and CTO @ Automi | Sales and Project Manager | Engineering | Patent-Pending Inventor | Adjunct Fellow UNSW
In the AI Toolbox series, we aim to provide you with key insights into important tools for building AI systems. In this article, we will discuss Vector Search, an essential tool for machines to understand meaning and context in text and images. With enormous applications for Computer Vision, Fraud Detection, and Recommender Systems, let's take the next steps in our ML/AI journey.
Key Terms and Concepts
Introduction
Vector search, also known as similarity?search, semantic search, or nearest neighbor search, is a technique that has become integral to many machine learning (ML) and artificial intelligence (AI) applications.
If you're looking to develop accurate, efficient, and scalable recommendation systems, information retrieval, accurate agentic assistants, and integrated computer vision solutions, it is essential to understand vector search.??
Vector Search is a particularly useful technique as we continue to build AI systems that better understand human interactions, behavior and requirements. It's also essential to know when vector search should or should not be applied, and how to configure it correctly for your particular use case.
In this article, we explore the concept of the vector search.
Finally, we will provide a practical example of its implementation so that you can start incorporating it into your ML and AI implementations.
VECTOR SEARCH - A PRIMER
Why do we need Vector Search?
First, let's understand why Vector Search is needed.
Language is a fundamental form of human communication. It consists of words used in a structured and conventional way. Humans often convey these words in speech, writing, or gestures [1].??
However, languages can often be vague or need clarification [2]; even when written directly, the complete meaning may not be obtained with a literal reading of the text (or image).
For example, in English, a word may have multiple meanings, or a definition may alter somewhat depending on context [3]. English, however, is quite a low-context language compared to Korean or Japanese [4]. These high-context languages create a far greater complexity for AI and ML models.
Humans understand the structure of the language we use through experience, grammar classes, and the context in which the words or gestures have been applied.
For machines however, a technique is needed to identify, measure, and rate these contexts, to accurately determine the correct meaning for use in interaction with human systems, the Vector Search,
What is Vector Search?
Vector Search is a technique for converting text or images into meaningful and analyzable data so that ML can derive the intended context and semantics.
To enable ML systems to perform mathematical calculations and to derive semantic contextual meaning, we first need to transform text and images, known as unstructured data, into multidimensional numerical representations called Vectors.? The vector representations are called Embeddings or Embedding Vectors, where the data is structured in a meaningful way. ?
These embedding vectors capture the semantic meaning or features of the data. Each embedding will have a value of the features associated with the word (or combinations of words) that have been processed.
Embeddings are stored in a Vector Data Store, Vector Database, or catalogue.
Vector search is a technique for finding the most similar vectors to a given query vector in a large dataset of embedding vectors.??
How does the Vector Search know what is the most similar?
Vector Search uses the concept of measuring the distance between vectors. This distance determines the "similarity" of one vector to another and is usually measured using distance metrics such as Euclidean distance, cosine similarity, or dot product [5].??
Shorter absolute distances between vectors mean they are more similar, while larger absolute distances between vectors indicate dissimilarity.? ?
Other distance methods, such as Manhattan Distance, are used in specific cases, such as looking specifically for outliers or where there is dimensionality, i.e. a large number of features / attributes / variables.
By searching through the catalog of vectors, Vector Search identifies the closest (or most similar) vectors in order of shortest distance. The shortest distance is also considered a vector's nearest neighbor.
When to Use Vector Search
Because vector search uses the similarity of vector representations, it can be useful for the following applications:
Recommendation Systems:
Information Retrieval:
Natural Language Processing (NLP):
Computer Vision:
Fraud/Anomaly Detection:
Clustering:
Duplicate Detection:
领英推荐
This is not a comprehensive list; however, the key is to be able to identify and cluster expected vs. unexpected behavior and similar or dissimilar profiles.
When Not to Use Vector Search
While vector search is powerful, it's not always the best solution for your use case.
In these scenarios, consider some alternatives to analyze:
Some Considerations for Vector Search Configuration
Although the absolute distance measure seems simple enough, it's important to remember that it results from a potentially sizeable multidimensional calculation. So, it's helpful to understand some of the critical characteristics of the type of vector search being utilized, when determining which Vector Search to implement.
Example: Implementing Vector Search with Faiss
Now, let's get practical and explore an example of using vector search in Python.
For this example, we'll use the Faiss library, developed by Facebook AI Research, which is highly efficient for similarity search and clustering dense vectors.
Dependent Libraries
First, we'll install the necessary libraries:
In your Python Environment / Terminal or Powershell / Bash
pip install faiss-cpu numpy scikit-learn
Methods
Now, let's create a Python script file called testfaiss.py to demonstrate vector search on a dataset of random vectors.
You can choose to place this into a Python class file, but for simplicity, we will keep all methods in a single file.
We'll break this down into small components for ease of understanding.
import numpy as np
import faiss
from sklearn.datasets import make_blobs
import time
Create a method to generate a sample dataset
def generate_dataset(num_vectors, num_dimensions):
X,_ = mb(n_samples=num_vectors, n_features=num_dimensions, centers=5, random_state=42)
return X.astype('float32')
Create a method to perform a brute force search
def brute_force_search(database, query, k):
distances = np.linalg.norm(database - query, axis=1)
indices = np.argsort(distances)[:k]
return indices, distances[indices]
Create a method to perform a FAISS Search
def faiss_search(index, query, k):
distances, indices = index.search(query.reshape(1, -1), k)
return indices[0], distances[0]
Main Method
Create the main function to run this example
def main():
num_vectors = 100000
num_dimensions = 128
k = 5
# Show message to the user
print(f"Generating dataset with {num_vectors} vectors of {num_dimensions} dimensions...")
dataset = generate_dataset(num_vectors, num_dimensions)
# Create a random query vector
query = np.random.rand(num_dimensions).astype('float32')
# Brute-force search
start_time = time.time()
bf_indices, bf_distances = brute_force_search(dataset, query, k)
bf_time = time.time() - start_time
print(f"\nBrute-force search time: {bf_time:.4f} seconds")
# Faiss search
index = faiss.IndexFlatL2(num_dimensions)
index.add(dataset)
start_time = time.time()
faiss_indices, faiss_distances = faiss_search(index, query, k)
faiss_time = time.time() - start_time
print(f"Faiss search time: {faiss_time:.4f} seconds")
# Print results
print(f"\nTop {k} nearest neighbors:")
print("Brute-force results:")
for i, (idx, dist) in enumerate(zip(bf_indices, bf_distances)):
print(f" {i+1}. Index: {idx}, Distance: {dist:.4f}")
print("\nFaiss results:")
for i, (idx, dist) in enumerate(zip(faiss_indices, faiss_distances)):
print(f" {i+1}. Index: {idx}, Distance: {dist:.4f}")
print(f"\nSpeed-up factor: {bf_time / faiss_time:.2f}x")
And finally, calling the main function
if __name__ == "__main__":
main()
Executing the FAISS Script
Now execute it using your IDE or in the command line as follows
> python testfaiss
You should get results like the following (which will differ for each run since you are generating random data).
Generating dataset with 100000 vectors of 128 dimensions...
Brute-force search time: 0.0451 seconds
Faiss search time: 0.0080 seconds
Top 5 nearest neighbors:
Brute-force results:
1. Index: 66500, Distance: 62.6519
2. Index: 37685, Distance: 62.7629
3. Index: 47971, Distance: 62.9782
4. Index: 16022, Distance: 63.0158
5. Index: 55022, Distance: 63.0963
Faiss results:
1. Index: 66500, Distance: 3925.2627
2. Index: 37685, Distance: 3939.1816
3. Index: 47971, Distance: 3966.2549
4. Index: 16022, Distance: 3970.9897
5. Index: 55022, Distance: 3981.1406
Speed-up factor: 5.63x
There you go, a first working version of using Vector Search!
Looking Ahead
Your implementation is likely to be more complex than this, and you will also most likely use the output from the Vector Search results in decision-making.
IF you are looking for an alternative to FAISS the top Alternatives to FAISS in 2024 include
Happy Vector Searching.
About the Co-Authors
Paul-Benjamin Ramírez is the CTO of Automi and writes about creativity, data and security, regulations, and AI David Willett is a technical leader in AI/ML implementations and a keen researcher in models and approaches and creates accessibility to AI/ML by demystifying the terminology.
Previous Articles in this Series
?? The AI Tool Box : #1 Combatting Hallucinations with Retrieval-Augmented Generation (RAG), Ramirez (2024)