Python Cosine Similarity
Vivek Sharma
Senior Consultant QA - Experience in API and UI Automation Testing, Python behave, Java, rest Assured & AI/ML Testing
Cosine Similarity is used as a metric for measuring distance when the magnitude of vector** does not matter. Example - Text data represented by count of worlds
**The magnitude of a vector refers to the length or size of the vector. It also determines its direction
There is one commonly used approach to check similarity of documents by the count of maximum number of common worlds between the document but it has one flaw - once the document size increases the common world tend to increase but both documents are talking about different topics.
The Cosine Similarity helps to overcome this flaw.
Mathematically cos θ = x ? y ( ‖ x ‖ ) ( ‖ y ‖ )
TF-IDF Algorithm
What is TF-IDF? Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in natural language processing and information retrieval. It measures how important a term is within a document relative to a collection of documents (i.e., relative to a corpus)
The Computer can understand the data only in the numerical value, for this reason we vectorize all the text so that computer understand the text better.
Math of TF-IDF
TF-IDF= Term Frequency * Inverse Document Frequency
t- Term (world)
d-Document(set of worlds)
n-Count of Corpus
corpus -Total document
TERM FREQUENCY
领英推荐
DOCUMENT FREQUENCY
INVERSE DOCUMENT FREQUENCY
Python Library:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
stopwords:
Nltk stop words are widely used words (such as “the,” “a,” “an,” or “in”) that a search engine has been configured to disregard while indexing and retrieving entries
Tokenizer:
Tokenizer is a compact pure-Python (>= 3.8) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc
Lemmatizer :
Lemmatization is the process of converting a word to its base form.
The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.
For example, lemmatization would correctly identify the base form of ‘caring’ to ‘care’, whereas, stemming would cutoff the ‘ing’ part and convert it to car. ‘Caring’ -> Lemmatization -> ‘Care’ ‘Caring’ -> Stemming -> ‘Car’
2023 & 2024 LinkedIn Top Quality Assurance Voice , Head of Digital Assurance at GlobalLogic, Driving Business Quality with Speed for GenAI Platforms
1 年Nice article Vivek! #GenAIQualityAssurance