0 free slots.REGISTER NOW GET FREE 888 PESOS REWARDS!

In the AI Toolbox series, we aim to provide you with key insights into important tools for building AI systems. In this article, we will discuss Vector Search, an essential tool for machines to understand meaning and context in text and images. With enormous applications for Computer Vision, Fraud Detection, and Recommender Systems, let's take the next steps in our ML/AI journey.

Key Terms and Concepts

Vectors
Embeddings
Similarity
Distance
Clustering

Introduction

Vector search, also known as similarity?search, semantic search, or nearest neighbor search, is a technique that has become integral to many machine learning (ML) and artificial intelligence (AI) applications.

If you're looking to develop accurate, efficient, and scalable recommendation systems, information retrieval, accurate agentic assistants, and integrated computer vision solutions, it is essential to understand vector search.??

Vector Search is a particularly useful technique as we continue to build AI systems that better understand human interactions, behavior and requirements. It's also essential to know when vector search should or should not be applied, and how to configure it correctly for your particular use case.

In this article, we explore the concept of the vector search.

Why has it come about?
When should you use it?
What are its limitations?
What are the alternatives to consider?

Finally, we will provide a practical example of its implementation so that you can start incorporating it into your ML and AI implementations.

VECTOR SEARCH - A PRIMER

Why do we need Vector Search?

First, let's understand why Vector Search is needed.

Language is a fundamental form of human communication. It consists of words used in a structured and conventional way. Humans often convey these words in speech, writing, or gestures [1].??

However, languages can often be vague or need clarification [2]; even when written directly, the complete meaning may not be obtained with a literal reading of the text (or image).

For example, in English, a word may have multiple meanings, or a definition may alter somewhat depending on context [3]. English, however, is quite a low-context language compared to Korean or Japanese [4]. These high-context languages create a far greater complexity for AI and ML models.

Humans understand the structure of the language we use through experience, grammar classes, and the context in which the words or gestures have been applied.

For machines however, a technique is needed to identify, measure, and rate these contexts, to accurately determine the correct meaning for use in interaction with human systems, the Vector Search,

What is Vector Search?

Vector Search is a technique for converting text or images into meaningful and analyzable data so that ML can derive the intended context and semantics.

To enable ML systems to perform mathematical calculations and to derive semantic contextual meaning, we first need to transform text and images, known as unstructured data, into multidimensional numerical representations called Vectors.? The vector representations are called Embeddings or Embedding Vectors, where the data is structured in a meaningful way. ?

These embedding vectors capture the semantic meaning or features of the data. Each embedding will have a value of the features associated with the word (or combinations of words) that have been processed.

Embeddings are stored in a Vector Data Store, Vector Database, or catalogue.

Vector search is a technique for finding the most similar vectors to a given query vector in a large dataset of embedding vectors.??

How does the Vector Search know what is the most similar?

Vector Search uses the concept of measuring the distance between vectors. This distance determines the "similarity" of one vector to another and is usually measured using distance metrics such as Euclidean distance, cosine similarity, or dot product [5].??

Shorter absolute distances between vectors mean they are more similar, while larger absolute distances between vectors indicate dissimilarity.? ?

Other distance methods, such as Manhattan Distance, are used in specific cases, such as looking specifically for outliers or where there is dimensionality, i.e. a large number of features / attributes / variables.

By searching through the catalog of vectors, Vector Search identifies the closest (or most similar) vectors in order of shortest distance. The shortest distance is also considered a vector's nearest neighbor.

When to Use Vector Search

Because vector search uses the similarity of vector representations, it can be useful for the following applications:

Recommendation Systems:

Finding similar items or users based on current and previous behavior or profile
Content-based filtering using feature vectors for similar or potentially new and interesting items

Information Retrieval:

Semantic search in document databases for relevant information or references
Image search using visual feature vectors

Natural Language Processing (NLP):

Finding similar words, sentences, or documents based on their feature vectors
More nuanced Question-answering systems that utilize the context of the conversation

Computer Vision:

Finding similar images Face recognition and verification

Fraud/Anomaly Detection:

Identifying unusual user activity or changed behaviors by examining the activities' vector distance from other vectors
Recognizing system errors or changing behavior of a system that may indicate external actors
Detecting the originality of images

Clustering:

As a building block for clustering algorithms like k-means or DBSCAN?

Duplicate Detection:

Finding near-duplicate items in large datasets
Reducing complexity by removing duplicate or similar data

This is not a comprehensive list; however, the key is to be able to identify and cluster expected vs. unexpected behavior and similar or dissimilar profiles.

When Not to Use Vector Search

While vector search is powerful, it's not always the best solution for your use case.

In these scenarios, consider some alternatives to analyze:

Small or Limited Datasets: For very small datasets, a simple brute-force search might be more efficient and easier to implement. The overhead of building and maintaining a vector index may outweigh its benefits. Traditional indexing methods are likely faster and simpler to implement in those scenarios.
Exact matching required: Vector search focuses on similarity rather than exactness. Traditional database indexing might be more appropriate if you need exact matches rather than similar items.
Low-dimensional, structured data: Traditional database queries might suit low-dimensional data with a clear structure. In cases where data is highly structured and relationships between different data entities (e.g., SQL joins) are crucial, vector search is not designed to handle these relationships as efficiently as traditional databases.
Categorical data: Vector search is primarily designed for continuous numerical data. Other techniques might be more appropriate for categorical data.
Interpretability is crucial: Interpreting the results of a vector search can be difficult at times, especially in high-dimensional spaces.
Limited computational resources: Some vector search algorithms require significant memory or computational power, and some resource-constrained environments may need to be underpowered to run these algorithms.

Some Considerations for Vector Search Configuration

Although the absolute distance measure seems simple enough, it's important to remember that it results from a potentially sizeable multidimensional calculation. So, it's helpful to understand some of the critical characteristics of the type of vector search being utilized, when determining which Vector Search to implement.

Distance Metrics: Which metric to use will depend on the type of data and expected output; Cosine similarity is functional for text and NLP tasks; Euclidean distances are often applied to Image search, and Dot Products used in dense embeddings and recommendation systems are.
Efficiency (Memory and Storage): Vector search algorithms are designed to find similar vectors in large datasets quickly; each may use different approaches to memory use (e.g. disk, or in-memory for speed)
Scalability: Many vector search methods can handle millions or billions of vectors. As the data size scales up, techniques like quantization, clustering, or approximate nearest neighbors (ANN) may be deployed to remain fast and accurate.
Approximate results: Some vector search algorithms trade perfect accuracy for speed, returning ANN instead of Exact Nearest Neighbor. ANN methods, such as HNSW, IVF, or FAISS, offer fast search results by sacrificing some precision, ideal for large-scale datasets.
Dimensionality support: To ensure fast and accurate search, High-dimensional models often require more sophisticated optimization techniques like quantization, clustering, or ANN.

Example: Implementing Vector Search with Faiss

Now, let's get practical and explore an example of using vector search in Python.

For this example, we'll use the Faiss library, developed by Facebook AI Research, which is highly efficient for similarity search and clustering dense vectors.

Dependent Libraries

First, we'll install the necessary libraries:

In your Python Environment / Terminal or Powershell / Bash

pip install faiss-cpu numpy scikit-learn

Methods

Now, let's create a Python script file called testfaiss.py to demonstrate vector search on a dataset of random vectors.

You can choose to place this into a Python class file, but for simplicity, we will keep all methods in a single file.

We'll break this down into small components for ease of understanding.

import numpy as np 
import faiss 
from sklearn.datasets import make_blobs 
import time

Create a method to generate a sample dataset

def generate_dataset(num_vectors, num_dimensions): 
    X,_ = mb(n_samples=num_vectors, n_features=num_dimensions, centers=5,    random_state=42) 
    return X.astype('float32')

Create a method to perform a brute force search

def brute_force_search(database, query, k): 
    distances = np.linalg.norm(database - query, axis=1) 
    indices = np.argsort(distances)[:k] 
    return indices, distances[indices]

Create a method to perform a FAISS Search

def faiss_search(index, query, k): 
    distances, indices = index.search(query.reshape(1, -1), k) 
    return indices[0], distances[0]

Main Method

Create the main function to run this example

def main(): 
    num_vectors = 100000 
    num_dimensions = 128 
    k = 5

    # Show message to the user    
    print(f"Generating dataset with {num_vectors} vectors of {num_dimensions} dimensions...")
    dataset = generate_dataset(num_vectors, num_dimensions)

    # Create a random query vector
    query = np.random.rand(num_dimensions).astype('float32')

    # Brute-force search
    start_time = time.time()
    bf_indices, bf_distances = brute_force_search(dataset, query, k)
    bf_time = time.time() - start_time
    print(f"\nBrute-force search time: {bf_time:.4f} seconds")

    # Faiss search
    index = faiss.IndexFlatL2(num_dimensions)
    index.add(dataset)

    start_time = time.time()
    faiss_indices, faiss_distances = faiss_search(index, query, k)
    faiss_time = time.time() - start_time
    print(f"Faiss search time: {faiss_time:.4f} seconds")

    # Print results
    print(f"\nTop {k} nearest neighbors:")
    print("Brute-force results:")
    for i, (idx, dist) in enumerate(zip(bf_indices, bf_distances)):
        print(f"  {i+1}. Index: {idx}, Distance: {dist:.4f}")

    print("\nFaiss results:")
    for i, (idx, dist) in enumerate(zip(faiss_indices, faiss_distances)):
        print(f"  {i+1}. Index: {idx}, Distance: {dist:.4f}")

    print(f"\nSpeed-up factor: {bf_time / faiss_time:.2f}x")

And finally, calling the main function

if __name__ == "__main__":
    main()

Executing the FAISS Script

Now execute it using your IDE or in the command line as follows

> python testfaiss

You should get results like the following (which will differ for each run since you are generating random data).

Generating dataset with 100000 vectors of 128 dimensions...

Brute-force search time: 0.0451 seconds
Faiss search time: 0.0080 seconds

Top 5 nearest neighbors:
Brute-force results:
  1. Index: 66500, Distance: 62.6519
  2. Index: 37685, Distance: 62.7629
  3. Index: 47971, Distance: 62.9782
  4. Index: 16022, Distance: 63.0158
  5. Index: 55022, Distance: 63.0963

Faiss results:
  1. Index: 66500, Distance: 3925.2627
  2. Index: 37685, Distance: 3939.1816
  3. Index: 47971, Distance: 3966.2549
  4. Index: 16022, Distance: 3970.9897
  5. Index: 55022, Distance: 3981.1406

Speed-up factor: 5.63x

There you go, a first working version of using Vector Search!

Looking Ahead

Your implementation is likely to be more complex than this, and you will also most likely use the output from the Vector Search results in decision-making.

IF you are looking for an alternative to FAISS the top Alternatives to FAISS in 2024 include

Happy Vector Searching.

About the Co-Authors

Paul-Benjamin Ramírez is the CTO of Automi and writes about creativity, data and security, regulations, and AI David Willett is a technical leader in AI/ML implementations and a keen researcher in models and approaches and creates accessibility to AI/ML by demystifying the terminology.

Previous Articles in this Series

?? The AI Tool Box : #1 Combatting Hallucinations with Retrieval-Augmented Generation (RAG), Ramirez (2024)

References

[1] speakingLevelCode, United Nations,

[2] "Vague language and context dependence", Lim Wooyoung , Wu Qinggong, Frontiers in Behavioral Economics, Vol 2, (2023)

[3] "A Study on English Writing Pattern Under the Impact of High-context and Low-context Cultures", Zou, Yumei. (2019)

[4] The influence of high/low-context culture and power distance on choice of communication media: Students' media choice to communicate with Professors in Japan and America" Richardson, Smith, International Journal of Intercultural Relations, Vol31, Issue4 (2007)

[5] Vector Similarity, Pincone

[6] FAISS. Meta Engineering

The AI ToolBox #2: Vector Search in Machine Learning and AI

Paul-Benjamin Ramírez

Co-Founder and CTO @ Automi | Patent-Pending Inventor | Adjunct Fellow UNSW

Introduction

VECTOR SEARCH - A PRIMER

Why do we need Vector Search?

What is Vector Search?

How does the Vector Search know what is the most similar?

When to Use Vector Search

领英推荐

When Not to Use Vector Search

Example: Implementing Vector Search with Faiss

Dependent Libraries

Methods

Main Method

Executing the FAISS Script

Looking Ahead

About the Co-Authors

Previous Articles in this Series

References

更多精彩文章

社区洞察

其他会员也浏览了

What's next for an organization after Digital Transformation?? Use AI/ML and get more value from the invaluable data your org has generated!

Unveiling RAG: The future of AI-powered knowledge discovery

Use of AI in Duplicate File Management

What is Artificial intelligence ?

Exploring the Core Concepts of AI

What is artificial intelligence?

Artificial Intelligence, Machine Learning, Big Data Analysis – Emerging Technologies Continued

Unpacking the Power of AI: A Deep Dive into Retrieval-Augmented Generation Workflows

Part Two: Artificial Intelligence: What’s Wrong, What’s Missing, What’s Next?

When to Use OpenAI’s o1 Model: A Deep Dive into the Right Contexts for Reasoning AI

Introduction

VECTOR SEARCH - A PRIMER

Why do we need Vector Search?

What is Vector Search?

How does the Vector Search know what is the most similar?

When to Use Vector Search

领英推荐

When Not to Use Vector Search

Example: Implementing Vector Search with Faiss

Dependent Libraries

Methods

Main Method

Executing the FAISS Script

Looking Ahead

About the Co-Authors

Previous Articles in this Series

References

AI ToolBox #3: Fine-Tuning in Machine Learning and AI

2024年9月17日

The AI Tool Box : #1 Combatting Hallucinations with Retrieval-Augmented Generation (RAG)

2024年8月27日

Code, Ethics, and Chaos: AI Digital Guardrails - Part 2: Deepening Understanding and Maturity Model

2024年8月21日

The Creative Spark #9 - Revolutionizing Culinary Arts: Pioneers, AI, and Robotics

2024年8月19日

Code, Ethics, and Chaos: Navigating the AI Frontier with Digital Guardrails - Part 1: Understanding the Landscape

2024年8月13日

The EU AI Act: A Game-Changer for Global Business - What You Need to Know and Do Now

2024年8月6日

The Creative Spark #8 - Soaring to New Heights: The Future of Air Travel in the Age of AI

2024年8月4日

Global Harmony or Regulatory Chaos? AI's Role in Unifying Medical Device Laws

2024年7月31日

The Creative Spark #7 - Emotional Intelligence: The Human Creative Edge in the Age of AI

2024年7月29日

AI and Data Privacy: Navigating the Complexities of the Digital Age (Article #3)

2024年7月29日

社区洞察

其他会员也浏览了

What's next for an organization after Digital Transformation?? Use AI/ML and get more value from the invaluable data your org has generated!

Unveiling RAG: The future of AI-powered knowledge discovery

Use of AI in Duplicate File Management

What is Artificial intelligence ?

Exploring the Core Concepts of AI

What is artificial intelligence?

Artificial Intelligence, Machine Learning, Big Data Analysis – Emerging Technologies Continued

Unpacking the Power of AI: A Deep Dive into Retrieval-Augmented Generation Workflows

Part Two: Artificial Intelligence: What’s Wrong, What’s Missing, What’s Next?

When to Use OpenAI’s o1 Model: A Deep Dive into the Right Contexts for Reasoning AI