The AI ToolBox #2: Vector Search in Machine Learning and AI
Designed by Microsoft Designer

The AI ToolBox #2: Vector Search in Machine Learning and AI

In the AI Toolbox series, we aim to provide you with key insights into important tools for building AI systems. In this article, we will discuss Vector Search, an essential tool for machines to understand meaning and context in text and images. With enormous applications for Computer Vision, Fraud Detection, and Recommender Systems, let's take the next steps in our ML/AI journey.


Key Terms and Concepts

  • Vectors
  • Embeddings
  • Similarity
  • Distance
  • Clustering


Introduction

Vector search, also known as similarity?search, semantic search, or nearest neighbor search, is a technique that has become integral to many machine learning (ML) and artificial intelligence (AI) applications.

If you're looking to develop accurate, efficient, and scalable recommendation systems, information retrieval, accurate agentic assistants, and integrated computer vision solutions, it is essential to understand vector search.??

Vector Search is a particularly useful technique as we continue to build AI systems that better understand human interactions, behavior and requirements. It's also essential to know when vector search should or should not be applied, and how to configure it correctly for your particular use case.

In this article, we explore the concept of the vector search.

  • Why has it come about?
  • When should you use it?
  • What are its limitations?
  • What are the alternatives to consider?

Finally, we will provide a practical example of its implementation so that you can start incorporating it into your ML and AI implementations.


Image by FreePik

VECTOR SEARCH - A PRIMER

Why do we need Vector Search?

First, let's understand why Vector Search is needed.

Language is a fundamental form of human communication. It consists of words used in a structured and conventional way. Humans often convey these words in speech, writing, or gestures [1].??

However, languages can often be vague or need clarification [2]; even when written directly, the complete meaning may not be obtained with a literal reading of the text (or image).

For example, in English, a word may have multiple meanings, or a definition may alter somewhat depending on context [3]. English, however, is quite a low-context language compared to Korean or Japanese [4]. These high-context languages create a far greater complexity for AI and ML models.

Humans understand the structure of the language we use through experience, grammar classes, and the context in which the words or gestures have been applied.

For machines however, a technique is needed to identify, measure, and rate these contexts, to accurately determine the correct meaning for use in interaction with human systems, the Vector Search,


What is Vector Search?

Vector Search is a technique for converting text or images into meaningful and analyzable data so that ML can derive the intended context and semantics.

To enable ML systems to perform mathematical calculations and to derive semantic contextual meaning, we first need to transform text and images, known as unstructured data, into multidimensional numerical representations called Vectors.? The vector representations are called Embeddings or Embedding Vectors, where the data is structured in a meaningful way. ?

These embedding vectors capture the semantic meaning or features of the data. Each embedding will have a value of the features associated with the word (or combinations of words) that have been processed.

Embeddings are stored in a Vector Data Store, Vector Database, or catalogue.

Vector search is a technique for finding the most similar vectors to a given query vector in a large dataset of embedding vectors.??


How does the Vector Search know what is the most similar?

Vector Search uses the concept of measuring the distance between vectors. This distance determines the "similarity" of one vector to another and is usually measured using distance metrics such as Euclidean distance, cosine similarity, or dot product [5].??


Euclidian Distance



Cosine Distance


Dot Product

Shorter absolute distances between vectors mean they are more similar, while larger absolute distances between vectors indicate dissimilarity.? ?

Other distance methods, such as Manhattan Distance, are used in specific cases, such as looking specifically for outliers or where there is dimensionality, i.e. a large number of features / attributes / variables.

By searching through the catalog of vectors, Vector Search identifies the closest (or most similar) vectors in order of shortest distance. The shortest distance is also considered a vector's nearest neighbor.



Source: Paul Ramirez | Automi



When to Use Vector Search


Because vector search uses the similarity of vector representations, it can be useful for the following applications:


Recommendation Systems:

  • Finding similar items or users based on current and previous behavior or profile
  • Content-based filtering using feature vectors for similar or potentially new and interesting items


Information Retrieval:

  • Semantic search in document databases for relevant information or references
  • Image search using visual feature vectors


Natural Language Processing (NLP):

  • Finding similar words, sentences, or documents based on their feature vectors
  • More nuanced Question-answering systems that utilize the context of the conversation


Computer Vision:

  • Finding similar images Face recognition and verification


Fraud/Anomaly Detection:

  • Identifying unusual user activity or changed behaviors by examining the activities' vector distance from other vectors
  • Recognizing system errors or changing behavior of a system that may indicate external actors
  • Detecting the originality of images


Clustering:

  • As a building block for clustering algorithms like k-means or DBSCAN?


Duplicate Detection:

  • Finding near-duplicate items in large datasets
  • Reducing complexity by removing duplicate or similar data


This is not a comprehensive list; however, the key is to be able to identify and cluster expected vs. unexpected behavior and similar or dissimilar profiles.



When Not to Use Vector Search

While vector search is powerful, it's not always the best solution for your use case.

In these scenarios, consider some alternatives to analyze:

  1. Small or Limited Datasets: For very small datasets, a simple brute-force search might be more efficient and easier to implement. The overhead of building and maintaining a vector index may outweigh its benefits. Traditional indexing methods are likely faster and simpler to implement in those scenarios.
  2. Exact matching required: Vector search focuses on similarity rather than exactness. Traditional database indexing might be more appropriate if you need exact matches rather than similar items.
  3. Low-dimensional, structured data: Traditional database queries might suit low-dimensional data with a clear structure. In cases where data is highly structured and relationships between different data entities (e.g., SQL joins) are crucial, vector search is not designed to handle these relationships as efficiently as traditional databases.
  4. Categorical data: Vector search is primarily designed for continuous numerical data. Other techniques might be more appropriate for categorical data.
  5. Interpretability is crucial: Interpreting the results of a vector search can be difficult at times, especially in high-dimensional spaces.
  6. Limited computational resources: Some vector search algorithms require significant memory or computational power, and some resource-constrained environments may need to be underpowered to run these algorithms.



Some Considerations for Vector Search Configuration

Although the absolute distance measure seems simple enough, it's important to remember that it results from a potentially sizeable multidimensional calculation. So, it's helpful to understand some of the critical characteristics of the type of vector search being utilized, when determining which Vector Search to implement.


  1. Distance Metrics: Which metric to use will depend on the type of data and expected output; Cosine similarity is functional for text and NLP tasks; Euclidean distances are often applied to Image search, and Dot Products used in dense embeddings and recommendation systems are.
  2. Efficiency (Memory and Storage): Vector search algorithms are designed to find similar vectors in large datasets quickly; each may use different approaches to memory use (e.g. disk, or in-memory for speed)
  3. Scalability: Many vector search methods can handle millions or billions of vectors. As the data size scales up, techniques like quantization, clustering, or approximate nearest neighbors (ANN) may be deployed to remain fast and accurate.
  4. Approximate results: Some vector search algorithms trade perfect accuracy for speed, returning ANN instead of Exact Nearest Neighbor. ANN methods, such as HNSW, IVF, or FAISS, offer fast search results by sacrificing some precision, ideal for large-scale datasets.
  5. Dimensionality support: To ensure fast and accurate search, High-dimensional models often require more sophisticated optimization techniques like quantization, clustering, or ANN.



Example: Implementing Vector Search with Faiss

Now, let's get practical and explore an example of using vector search in Python.

For this example, we'll use the Faiss library, developed by Facebook AI Research, which is highly efficient for similarity search and clustering dense vectors.


Dependent Libraries

First, we'll install the necessary libraries:

In your Python Environment / Terminal or Powershell / Bash

pip install faiss-cpu numpy scikit-learn        


Methods

Now, let's create a Python script file called testfaiss.py to demonstrate vector search on a dataset of random vectors.

You can choose to place this into a Python class file, but for simplicity, we will keep all methods in a single file.

We'll break this down into small components for ease of understanding.

import numpy as np 
import faiss 
from sklearn.datasets import make_blobs 
import time        

Create a method to generate a sample dataset

def generate_dataset(num_vectors, num_dimensions): 
    X,_ = mb(n_samples=num_vectors, n_features=num_dimensions, centers=5,    random_state=42) 
    return X.astype('float32')        

Create a method to perform a brute force search

def brute_force_search(database, query, k): 
    distances = np.linalg.norm(database - query, axis=1) 
    indices = np.argsort(distances)[:k] 
    return indices, distances[indices]        

Create a method to perform a FAISS Search

def faiss_search(index, query, k): 
    distances, indices = index.search(query.reshape(1, -1), k) 
    return indices[0], distances[0]        

Main Method

Create the main function to run this example

def main(): 
    num_vectors = 100000 
    num_dimensions = 128 
    k = 5        
    # Show message to the user    
    print(f"Generating dataset with {num_vectors} vectors of {num_dimensions} dimensions...")
    dataset = generate_dataset(num_vectors, num_dimensions)

    # Create a random query vector
    query = np.random.rand(num_dimensions).astype('float32')

    # Brute-force search
    start_time = time.time()
    bf_indices, bf_distances = brute_force_search(dataset, query, k)
    bf_time = time.time() - start_time
    print(f"\nBrute-force search time: {bf_time:.4f} seconds")

    # Faiss search
    index = faiss.IndexFlatL2(num_dimensions)
    index.add(dataset)

    start_time = time.time()
    faiss_indices, faiss_distances = faiss_search(index, query, k)
    faiss_time = time.time() - start_time
    print(f"Faiss search time: {faiss_time:.4f} seconds")

    # Print results
    print(f"\nTop {k} nearest neighbors:")
    print("Brute-force results:")
    for i, (idx, dist) in enumerate(zip(bf_indices, bf_distances)):
        print(f"  {i+1}. Index: {idx}, Distance: {dist:.4f}")

    print("\nFaiss results:")
    for i, (idx, dist) in enumerate(zip(faiss_indices, faiss_distances)):
        print(f"  {i+1}. Index: {idx}, Distance: {dist:.4f}")

    print(f"\nSpeed-up factor: {bf_time / faiss_time:.2f}x")        

And finally, calling the main function

if __name__ == "__main__":
    main()        


Executing the FAISS Script

Now execute it using your IDE or in the command line as follows

> python testfaiss        

You should get results like the following (which will differ for each run since you are generating random data).

Generating dataset with 100000 vectors of 128 dimensions...

Brute-force search time: 0.0451 seconds
Faiss search time: 0.0080 seconds

Top 5 nearest neighbors:
Brute-force results:
  1. Index: 66500, Distance: 62.6519
  2. Index: 37685, Distance: 62.7629
  3. Index: 47971, Distance: 62.9782
  4. Index: 16022, Distance: 63.0158
  5. Index: 55022, Distance: 63.0963

Faiss results:
  1. Index: 66500, Distance: 3925.2627
  2. Index: 37685, Distance: 3939.1816
  3. Index: 47971, Distance: 3966.2549
  4. Index: 16022, Distance: 3970.9897
  5. Index: 55022, Distance: 3981.1406

Speed-up factor: 5.63x        

There you go, a first working version of using Vector Search!


Looking Ahead

Your implementation is likely to be more complex than this, and you will also most likely use the output from the Vector Search results in decision-making.


IF you are looking for an alternative to FAISS the top Alternatives to FAISS in 2024 include


Happy Vector Searching.


About the Co-Authors

Paul-Benjamin Ramírez is the CTO of Automi and writes about creativity, data and security, regulations, and AI David Willett is a technical leader in AI/ML implementations and a keen researcher in models and approaches and creates accessibility to AI/ML by demystifying the terminology.


Previous Articles in this Series

?? The AI Tool Box : #1 Combatting Hallucinations with Retrieval-Augmented Generation (RAG), Ramirez (2024)


References

[1] speakingLevelCode, United Nations,

[2] "Vague language and context dependence", Lim Wooyoung , Wu Qinggong, Frontiers in Behavioral Economics, Vol 2, (2023)

[3] "A Study on English Writing Pattern Under the Impact of High-context and Low-context Cultures", Zou, Yumei. (2019)

[4] The influence of high/low-context culture and power distance on choice of communication media: Students' media choice to communicate with Professors in Japan and America" Richardson, Smith, International Journal of Intercultural Relations, Vol31, Issue4 (2007)

[5] Vector Similarity, Pincone

[6] FAISS. Meta Engineering

要查看或添加评论,请登录

社区洞察

其他会员也浏览了