Build a Semantic Search Engine Using Sentence Transformers
Introduction
In today's data-driven world, the ability to quickly and accurately search through vast amounts of text it is crucial for many applications. Traditional keyword-based search methods do not capture the intent behind users' queries, and this is where semantic search comes into play.
Semantic Search allows to return results that are contextually relevant, even if the exact keywords are not present.
For example, a semantic search for "shoes for jogging" would understand that "running sneakers" or "athletic footwear" is a relevant result, even though the exact phrase does not appear in the query.
In this article I will show how to implement a Semantic Search System, testing different pre-trained models from the library sentence-transformers, and testing the semantic search on the same dataset with both symmetric and asymmetric approach.
The Dataset
To show how to implement a semantic search system using sentence transformers, I will use a Kaggle dataset consisting of 30,000 women's fashion products. However, the implementation outlined in this article can be easily adapted to any other dataset and domain. In fact, given my expertise in the iGaming industry, I've deployed semantic search to optimize the search engine within an online casino games lobby.
Implementing Semantic Search
Step 1: Setup and Data Preparation
First, we need to install the sentence-transformers library and import the necessary modules.
pip install sentence-transformers
Next, we load our sample dataset, perform a basic data analysis and preprocess it to combine relevant columns into a single text representation for each document.
The columns that will be used for our final dataset and combined together to obtain the corpus are:
Step 2: Load Pre-Trained Model and Encoding the Corpus
For the implementation of our semantic search system, we will be using and compare two models:
The main difference between paraphrase-MiniLM-L6-v2 and msmarco-MiniLM-L-6-v3 lies in their training data and the specific tasks they are optimized for.
paraphrase-MiniLM-L6-v2 is trained specifically for the task of paraphrase generation and understanding, while msmarco-MiniLM-L-6-v3 is optimized for information retrieval tasks using the MS MARCO (Microsoft Machine Reading Comprehension) dataset, that was created based on real user search queries using Bing search engine. Although the msmarco model focus is on information retrieval tasks like passage ranking and question-answering, it can still be used for in applications for semantic search.
Both models are based on the MiniLM architecture, which is a smaller variant of the Transformer architecture, and both have six layers.
We encode the corpus using the model (paraphrase or msmarco) to generate sentence embeddings. These embeddings will be used to compute similarity scores during the search. Note that if you are not using GPU or TPU, this operation could be slow.
Step 3: Implementing the Semantic Search Function
For small corpora (up to about 1 million entries) we can compute the cosine-similarity between the query and all entries in the corpus, so to perform semantic search, we can create a function to encode the query, computing cosine similarity scores between the query and the corpus embeddings, and returning the top 10 matching results.
Step 4: Performing a Search
We can now finally define a query and use our semantic search function to find the most relevant products in our dataset. We then print the top results.
Comparison and Analysis of Semantic Search Results
paraphrase-MiniLM-L6-v2
msmarco-MiniLM-L-6-v3
For the paraphrase-MiniLM-L6-v2 model, scores are consistently higher, ranging from 0.7132 to 0.7296., suggesting that this model might be better at identifying closely related items. Predominantly it returns synthetic women's casual sandals and sneakers, indicating strong semantic grouping but less variety.
With msmarco-MiniLM-L-6-v3, scores are slightly lower, ranging from 0.6303 to 0.6439, indicating a more flexible matching criterion. It shows more diverse results in terms of item types, indicating a broader interpretation of the query and a more diverse understanding of casual women's footwear.
Overall the results show that both models have their strengths.
If the goal is to find the most semantically similar items with high precision, paraphrase-MiniLM-L6-v2 would be preferable due to its higher similarity scores.
If the aim is to explore a broader range of products and provide users with diverse options, msmarco-MiniLM-L-6-v3 would be more suitable.
Fine-Tuning Transformer Models for Improved Semantic Search
One of the significant advantages of using transformer models is their ability to be fine-tuned for specific tasks with additional training data that closely matches the target application, thereby improving its performance and relevance to specific queries.
Conclusion
With the flexibility of the sentence-transformers library and the power of transformer models, we have built an efficient semantic search system that can understand a query and retrieve contextually relevant results from a dataset. This approach can be extended to various applications, such as product search, document retrieval, and more, providing a powerful tool for enhancing search capabilities in different domains.
Whether you're dealing with a small dataset or large-scale text corpora, these techniques can help you deliver better search experiences and unlock deeper insights from your data.