登录查看更多内容

Build a Semantic Search Engine Using Sentence Transformers

Domenico Vacchiano

Co-Founder at Cloud Crafter

发布日期: 2024年5月18日

Introduction

In today's data-driven world, the ability to quickly and accurately search through vast amounts of text it is crucial for many applications. Traditional keyword-based search methods do not capture the intent behind users' queries, and this is where semantic search comes into play.

Semantic Search allows to return results that are contextually relevant, even if the exact keywords are not present.

For example, a semantic search for "shoes for jogging" would understand that "running sneakers" or "athletic footwear" is a relevant result, even though the exact phrase does not appear in the query.

In this article I will show how to implement a Semantic Search System, testing different pre-trained models from the library sentence-transformers, and testing the semantic search on the same dataset with both symmetric and asymmetric approach.

The Dataset

To show how to implement a semantic search system using sentence transformers, I will use a Kaggle dataset consisting of 30,000 women's fashion products. However, the implementation outlined in this article can be easily adapted to any other dataset and domain. In fact, given my expertise in the iGaming industry, I've deployed semantic search to optimize the search engine within an online casino games lobby.

Implementing Semantic Search

Step 1: Setup and Data Preparation

First, we need to install the sentence-transformers library and import the necessary modules.

pip install sentence-transformers

Next, we load our sample dataset, perform a basic data analysis and preprocess it to combine relevant columns into a single text representation for each document.

The columns that will be used for our final dataset and combined together to obtain the corpus are:

BrandName
Details (please note that in the dataset there is a typo issue, so the column will be renamed from Deatils to Details)
Sizes
SellPrice
Discount
Category

Step 2: Load Pre-Trained Model and Encoding the Corpus

For the implementation of our semantic search system, we will be using and compare two models:

paraphrase-MiniLM-L6-v2 (symmetric semantic search)
msmarco-MiniLM-L-6-v3 (asymmetric semantic search)

The main difference between paraphrase-MiniLM-L6-v2 and msmarco-MiniLM-L-6-v3 lies in their training data and the specific tasks they are optimized for.

paraphrase-MiniLM-L6-v2 is trained specifically for the task of paraphrase generation and understanding, while msmarco-MiniLM-L-6-v3 is optimized for information retrieval tasks using the MS MARCO (Microsoft Machine Reading Comprehension) dataset, that was created based on real user search queries using Bing search engine. Although the msmarco model focus is on information retrieval tasks like passage ranking and question-answering, it can still be used for in applications for semantic search.

Both models are based on the MiniLM architecture, which is a smaller variant of the Transformer architecture, and both have six layers.

We encode the corpus using the model (paraphrase or msmarco) to generate sentence embeddings. These embeddings will be used to compute similarity scores during the search. Note that if you are not using GPU or TPU, this operation could be slow.

Step 3: Implementing the Semantic Search Function

For small corpora (up to about 1 million entries) we can compute the cosine-similarity between the query and all entries in the corpus, so to perform semantic search, we can create a function to encode the query, computing cosine similarity scores between the query and the corpus embeddings, and returning the top 10 matching results.

Step 4: Performing a Search

We can now finally define a query and use our semantic search function to find the most relevant products in our dataset. We then print the top results.

Comparison and Analysis of Semantic Search Results

paraphrase-MiniLM-L6-v2

msmarco-MiniLM-L-6-v3

For the paraphrase-MiniLM-L6-v2 model, scores are consistently higher, ranging from 0.7132 to 0.7296., suggesting that this model might be better at identifying closely related items. Predominantly it returns synthetic women's casual sandals and sneakers, indicating strong semantic grouping but less variety.

With msmarco-MiniLM-L-6-v3, scores are slightly lower, ranging from 0.6303 to 0.6439, indicating a more flexible matching criterion. It shows more diverse results in terms of item types, indicating a broader interpretation of the query and a more diverse understanding of casual women's footwear.

Overall the results show that both models have their strengths.

If the goal is to find the most semantically similar items with high precision, paraphrase-MiniLM-L6-v2 would be preferable due to its higher similarity scores.

If the aim is to explore a broader range of products and provide users with diverse options, msmarco-MiniLM-L-6-v3 would be more suitable.

Fine-Tuning Transformer Models for Improved Semantic Search

One of the significant advantages of using transformer models is their ability to be fine-tuned for specific tasks with additional training data that closely matches the target application, thereby improving its performance and relevance to specific queries.

Conclusion

With the flexibility of the sentence-transformers library and the power of transformer models, we have built an efficient semantic search system that can understand a query and retrieve contextually relevant results from a dataset. This approach can be extended to various applications, such as product search, document retrieval, and more, providing a powerful tool for enhancing search capabilities in different domains.

Whether you're dealing with a small dataset or large-scale text corpora, these techniques can help you deliver better search experiences and unlock deeper insights from your data.

要查看或添加评论，请登录

Domenico Vacchiano的更多文章

Simplify Exploratory Data Analysis and Data Cleaning With Multi-Agent Systems.

2025年1月17日

Simplify Exploratory Data Analysis and Data Cleaning With Multi-Agent Systems.

Introduction A multi-agent system consists of a collection of intelligent agents, each tasked with a specific role…
Multi-Agent Systems: Automating Infrastructure as Code Generation from Architecture Diagrams

2025年1月11日

Multi-Agent Systems: Automating Infrastructure as Code Generation from Architecture Diagrams

Introduction Imagine having a solution where your architecture diagrams seamlessly transform into structured and…
Building a Real-Time Speech Translator Using Amazon's AI Services

2024年9月29日

Building a Real-Time Speech Translator Using Amazon's AI Services

Throughout this article, we will explore the development of a real-time speech translation application built using…
Building a Real-Time Player Bonus Reward System Using Neural Networks

2024年6月1日

Building a Real-Time Player Bonus Reward System Using Neural Networks

Introduction In the competitive world of online gambling, player retention and engagement are critical for business…
Clustering Players Game Transactions with Amazon SageMaker

2024年5月28日

Clustering Players Game Transactions with Amazon SageMaker

In this article, we will explore a practical application of K-Means clustering combined with Principal Component…
Train a Model with Neural Networks, for Responsible Gaming Predictions and Monitoring

2024年4月24日

Train a Model with Neural Networks, for Responsible Gaming Predictions and Monitoring

Introduction A while back, I came across an intriguing article on the AWS Machine Learning Blog that captured my…
Engineering Team Spotlight

2021年2月22日

Engineering Team Spotlight

After more than 20 years working in tech as a developer, an architect and a technology manager, in January 2021 I…
API Composition Pattern with GraphQL

2020年4月2日

API Composition Pattern with GraphQL

Introduction When you decide to embrace a microservices architecture, you need to be prepared to face several…
Distributed Tracing: Instrumenting and tracing NodeJs microservices with Zipkin

2020年2月8日

Distributed Tracing: Instrumenting and tracing NodeJs microservices with Zipkin

Introduction In a microservices architecture a single application, performing one or more operations, can trigger a…
Fn Project & Node.Js: playing with a wheel of fortune!

2019年12月14日

Fn Project & Node.Js: playing with a wheel of fortune!

Introduction Almost a year ago, I published another article (https://www.linkedin.

See all articles

Introduction

The Dataset

Implementing Semantic Search

Step 1: Setup and Data Preparation

Step 2: Load Pre-Trained Model and Encoding the Corpus

Step 3: Implementing the Semantic Search Function

Step 4: Performing a Search

Comparison and Analysis of Semantic Search Results

paraphrase-MiniLM-L6-v2

msmarco-MiniLM-L-6-v3

Fine-Tuning Transformer Models for Improved Semantic Search

Conclusion

Domenico Vacchiano的更多文章

Simplify Exploratory Data Analysis and Data Cleaning With Multi-Agent Systems.

Multi-Agent Systems: Automating Infrastructure as Code Generation from Architecture Diagrams

Building a Real-Time Speech Translator Using Amazon's AI Services

Building a Real-Time Player Bonus Reward System Using Neural Networks

Clustering Players Game Transactions with Amazon SageMaker

Train a Model with Neural Networks, for Responsible Gaming Predictions and Monitoring

Engineering Team Spotlight

API Composition Pattern with GraphQL

Distributed Tracing: Instrumenting and tracing NodeJs microservices with Zipkin

Fn Project & Node.Js: playing with a wheel of fortune!