登录查看更多内容

Testing LLM Query Outputs with Cosine Similarity

Sagar Shroff

Sr Software Development Engineer In Test - Selenium | Cucumber | Karate | Cypress | Javascript | Java | AWS

发布日期: 2025年2月25日

Introduction

Few weeks ago, I was pondering on the thought on how to effectively test LLM based application features since their output is non-deterministic. As more applications tightly integrating AI into their product suite, traditional testing techniques such as inputA-expect-outputB would not work. Thankfully, through one of the newsletter, I came across a very well articulated Article by Amruta Pande which introduced me to - Metamorphic testing technique.

In this article, I am going to extend on the Metamorphic Technique, particularly explaining on my experiments of using Cosine Similarity as a method to quantitatively compare outputs to validate if the metamorphic relation appears holds true.

I will first start-off by explaining the problem statement, followed up by explaining about how technique such as Metamorphic Testing is helpful for validating non-deterministic output, and lastly touching on explaining about my experiment of using Cosine Similarity as a tool and some limitations I observed.

Why is LLM output indeterministic?

Most LLMs are built on GPT (Generative Pre-Trained Transformer) architecture which uses decoder based design approach optimized to generate new text. This is different from encoder-based models like BERT or similar encoding based architecture which are great at understanding and processing text instead of generating new text.

To learn more about the history and different architecture I would highly recommend this amazing video series on Youtube by Dr Raj Abhijit Dandekar . The second lesson in Hands on Large Language Models nicely explains about the history and different open and closed source architectures.

What is Metamorphic Testing?

The article, that I earlier referenced in the intro, does a great job in explaining different techniques around this, but I will try to mention few key ideas.

Metamorphic Testing uses Metamorphic Relations (MR) to find out how the output should change or remain similar when the input is changed. Or to test if the MR holds true between two pairs of input-output.

For example, if I ask the LLM what are the drawbacks of eating outside? The LLM would respond me with a list of drawbacks. Now if I ask another similar question - What negative aspects come with outdoor dining? The answer would still be semantically same.

But what if I ask - What are the benefits of eating outside? The LLM should then list some benefits such as Stress Reduction, Improving Mood etc. This second question flips the sentiment as you can see.

Now, Metamorphic Relation is useful here because we can use it as tool to test non-deterministic outputs like above to gauge if the relation has changed for the answers given by LLM.

My experiment of using Cosine Similarity to test MR programmatically

As a next step, I wanted to find what tools could we use to test if MR holds true or not programmatically. There were multiple options, one being using LLM itself to test its output but I wanted to explore more options.

In the intro, I briefly mentioned that encoder based architectures are good at understanding and processing text. So I thought of using such as SentenceTransformer with cosine similarity as tool to quantitatively measure if 2 texts are same/different/opposite.

My idea was we can use Cosine similarity to measure the angle between two vectors in vector space, thereby quantifying the similarity between any two textual outputs from LLM.

领英推荐

Meet Sora: The AI Model Blurring the Lines Between…

Data Science Dojo 1 年前

The Big O notation and its significance in LLMs

Tarry Singh 3 个月前

?? Moving beyond RAG

Pascal Biese 1 年前

Cosine similarity, which computes the cosine of the angle between two vectors, returns a value within the range [-1, 1]. And depending on the value return it could mean of the following:

If the value is closer to -1 then it means that for the given two sentences, they could be rather opposite meaning of each other.
If the value is closer to 0 then it means two sentences do not have any relation and could be completely different sentences.
If the result is closer to 1 then it means they could be similar sentences.

This way we can test the LLM by testing different pairs of prompt and test the Metamorphic relation between their output is similar, or different, or opposite.

Some more context about Vectors

So Machine Learning models do not directly understand textual data, for them texts or words are resembled as numeric vectors in high dimensional vector space. A vector space can be made of 1, 2 or many dimensions (high dimension) based on the number of features in the sentence. Each dimension in vector space represents a different feature of a text.

For example, take the sentence "Sagar is a Senior SDET proficient in Java and Python." One dimension might capture the seniority level indicated by "Senior SDET," another could reflect the technical skills like proficiency in "Java" and "Python," and another might represent the overall professional context. There could be more dimensions that might encode information about the emotions, context, etc.

All the sentences are represented as numeric vector in the vector space. And my solution here is simply measuring how close 2 vectors are to each other in vector space to find their similarity.

Implementation Example

I have hooked up my Jupyter notebook here with the outputs.

https://github.com/shroffsagar/cosine_similarity_metamorphic_testing/blob/main/llm_metamorphic_testing.ipynb

Limitations

While I was playing around, I observed following limitations

Cosine similarity only measures the geometry closeness of the texts and does not understand deeper sentiments. For example, if you give below 2 sentences to test then cosine similarity still results into high similarity score.

Output for query1: She is really excited about her promotion
Output for query2: She is not really excited about her promotion
Cosine similarity between the outputs: 0.9077736139297485
The texts are highly similar.

SentenceTransformer model might miss some nuances or emotions. For example below, I believe we should expect they are opposite emotions.

Output for query1: Oh great, another rainy day!!!
Output for query2: Oh great, another day at the beach, I am so lucky
Cosine similarity between the outputs: 0.5919630527496338
The texts are not closely related.

Rahul Shetty (Venkatesh)

3 周

Good read. ?? I think you should also look at LLM eval frameworks like RAGAS which can do this with straight forward wrapper methods.

Bibhuti Jha

SDET

3 周

We have evals framework to do it Where you send your prompt , response from LLM and the eval instructions You can test your response based on a few params like instruction following, completeness , coherence. https://github.com/openai/evals Now I am assuming eval might be using coherence similarly internally.

1 次回应

Karthik K.K

Test Automation Consultant ?? Avid Tech YouTuber ?? Udemy Trainer ??

3 周

Well, most of the vector database supports Cosine similarity vector_store.similarity_search_by_vector(query_embedding, k=2) So, if you are retrieving data from vector stores which is returned by the large language model then this method will do what you are pointing out right now

Amit Sawant

Software Engineering Manager Dev Infra (Platform) and Tools at Workday Analytics

3 周

Insightful

1 次回应

查看更多评论

要查看或添加评论，请登录

Sagar Shroff的更多文章

TF-IDF Technique Overview

2025年2月18日

TF-IDF Technique Overview

I am currently learning about feature-engineering, and today I explored the TF-IDF technique. I decided to write it up…

1 条评论
Java - Metaprogramming - Ability to add new functionality to existing API

2021年5月25日

Java - Metaprogramming - Ability to add new functionality to existing API

Ever came across scenario where you wished you could possibly add new functionality to an existing Java API? i.e say…

2 条评论
Approaching test automation development

2020年4月18日

Approaching test automation development

In today's fast-paced product development, test-automation is one of the key to drive the organization's ability to…

7 条评论
Attaching protractor to existing browser instance

2017年7月21日

Attaching protractor to existing browser instance

Protractor-test-runner by default behavior is Creates web-driver instance & opens up browser Executes your tests Kills…
How to correctly implement a RetryAnalyzer in TestNG

2015年11月11日

How to correctly implement a RetryAnalyzer in TestNG

A very common requirement while automating testing of a product is you may wish to implement a retry mechanism for a…

12 条评论
Working along with Groovy Closures

2015年4月4日

Working along with Groovy Closures

Closures, they play a very key role in Groovy vocabulary. They are used everywhere in groovy API.

See all articles

Testing LLM Query Outputs with Cosine Similarity

Sagar Shroff

Sr Software Development Engineer In Test - Selenium | Cucumber | Karate | Cypress | Javascript | Java | AWS

Introduction

Why is LLM output indeterministic?

What is Metamorphic Testing?

My experiment of using Cosine Similarity to test MR programmatically

领英推荐

Some more context about Vectors

Implementation Example

Limitations

Sagar Shroff的更多文章

社区洞察

其他会员也浏览了

Towards Advanced RAG

???? The Next Impact Factor

?? LLMs Struggle With Causality

From Regression to Reasoning — A brief Intro & Use Cases by industry verticals

?? Infinite Text Input? This changes everything.

When to Use GraphRAG

??Top ML Papers of the Week

LLM 2.0, the New Generation of Large Language Models

Why Vector Databases Are Important for Large Language Models (LLMs)

???????????? ?????????????????? ?????? ?????? ????????????????????????

Introduction

Why is LLM output indeterministic?

What is Metamorphic Testing?

My experiment of using Cosine Similarity to test MR programmatically

领英推荐

Some more context about Vectors

Implementation Example

Limitations

Sagar Shroff的更多文章

TF-IDF Technique Overview

Java - Metaprogramming - Ability to add new functionality to existing API

Approaching test automation development

Attaching protractor to existing browser instance

How to correctly implement a RetryAnalyzer in TestNG

Working along with Groovy Closures

社区洞察

其他会员也浏览了

Towards Advanced RAG

???? The Next Impact Factor

?? LLMs Struggle With Causality

From Regression to Reasoning — A brief Intro & Use Cases by industry verticals

?? Infinite Text Input? This changes everything.

When to Use GraphRAG

??Top ML Papers of the Week

LLM 2.0, the New Generation of Large Language Models

Why Vector Databases Are Important for Large Language Models (LLMs)

???????????? ?????????????????? ?????? ?????? ????????????????????????