登录查看更多内容

Python Cosine Similarity

Vivek Sharma

Senior Consultant QA - Experience in API and UI Automation Testing, Python behave, Java, rest Assured & AI/ML Testing

发布日期: 2024年2月28日

Cosine Similarity is used as a metric for measuring distance when the magnitude of vector** does not matter. Example - Text data represented by count of worlds

**The magnitude of a vector refers to the length or size of the vector. It also determines its direction

There is one commonly used approach to check similarity of documents by the count of maximum number of common worlds between the document but it has one flaw - once the document size increases the common world tend to increase but both documents are talking about different topics.

The Cosine Similarity helps to overcome this flaw.

Mathematically cos θ = x ? y ( ‖ x ‖ ) ( ‖ y ‖ )

TF-IDF Algorithm

What is TF-IDF? Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in natural language processing and information retrieval. It measures how important a term is within a document relative to a collection of documents (i.e., relative to a corpus)

The Computer can understand the data only in the numerical value, for this reason we vectorize all the text so that computer understand the text better.

Math of TF-IDF

TF-IDF= Term Frequency * Inverse Document Frequency

t- Term (world)

d-Document(set of worlds)

n-Count of Corpus

corpus -Total document

TERM FREQUENCY

This measures the frequency of a word in a document.
When we are vectorizing the documents, we check for each words count. In worst case if the term doesn't exist in the document, then that particular TF value will be 0 and in other extreme case, if all the words in the document are same, then it will be 1. The final value of the normalized TF value will be in the range of [0 to 1]. 0, 1 inclusive.
Term frequency (TF) is how often a word appears in a document, divided by how many words there are.
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

领英推荐

Langchain

Darshika Srivastava 1 年前

How to Build a Singing Voice Cloning Model in Python

Kishan Mehta 4 个月前

As we say in Python "Hello World!"

Nix Good 1 年前

DOCUMENT FREQUENCY

This measures the importance of document in whole set of corpus, this is very similar to TF. The only difference is that TF is frequency counter for a term t in document d, where as DF is the count of occurrences of term t in the document set N.
DF is the number of documents in which the word is present.

INVERSE DOCUMENT FREQUENCY

Term frequency is how common a word is, inverse document frequency (IDF) is how unique or rare a word is.
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
TF-IDF weight is the product of term frequency and inverse document frequency.

Python Library:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

stopwords:

Nltk stop words are widely used words (such as “the,” “a,” “an,” or “in”) that a search engine has been configured to disregard while indexing and retrieving entries

Tokenizer:

Tokenizer is a compact pure-Python (>= 3.8) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc

Lemmatizer :

Lemmatization is the process of converting a word to its base form.

The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

For example, lemmatization would correctly identify the base form of ‘caring’ to ‘care’, whereas, stemming would cutoff the ‘ing’ part and convert it to car. ‘Caring’ -> Lemmatization -> ‘Care’ ‘Caring’ -> Stemming -> ‘Car’

Sanju Dalla

2023 & 2024 LinkedIn Top Quality Assurance Voice , Head of Digital Assurance at GlobalLogic, Driving Business Quality with Speed for GenAI Platforms

1 年

Nice article Vivek! #GenAIQualityAssurance

要查看或添加评论，请登录

Vivek Sharma的更多文章

LLM and Evaluation Methods

2024年8月25日

LLM and Evaluation Methods
Hallucination In AI Models

2024年6月22日

Hallucination In AI Models

Hallucination in Natural Language Generation: Hallucination occurs when a language model generates text that fits a…
Cypress Best Practices

2024年2月1日

Cypress Best Practices

Use data attribute when selecting the Elements, Automation QA can add data attribute in the code if not available that…
Robot Framework - Template usage in API automation

2024年1月30日

Robot Framework - Template usage in API automation

In Robot Framework the Test templates convert normal keyword driven test cases into data driven tests. Use Case : In…

1 条评论
Listeners in TestNG

2017年7月21日

Listeners in TestNG

If you want to add logs into your test script you can use listeners. Listeners work on action so have below methods.

See all articles

Python Cosine Similarity

Vivek Sharma

Senior Consultant QA - Experience in API and UI Automation Testing, Python behave, Java, rest Assured & AI/ML Testing

领英推荐

Vivek Sharma的更多文章

社区洞察

其他会员也浏览了

The Role of Python in AI/ML Development: A Deep Dive into Tools and Frameworks

LangChain Models

?? Create an AI Chatbot That Teaches Python Like a Pro ??

?? 7 Python AI Projects Every Beginner Should Try Today ??

Machine Learning with (Monty) Python

Natural Language Programming: A Semantic Assembly

Develop AI Using Python: A Step-by-Step Guide

Introducing the Revolutionary Self-Modifying GPT Python Script!

How to Perform Zero-Shot Text Classification Using Hugging Face Transformers Library in Python

Top Programming languages for AI Development

领英推荐

Vivek Sharma的更多文章

LLM and Evaluation Methods

Hallucination In AI Models

Cypress Best Practices

Robot Framework - Template usage in API automation

Listeners in TestNG

社区洞察

其他会员也浏览了

The Role of Python in AI/ML Development: A Deep Dive into Tools and Frameworks

LangChain Models

?? Create an AI Chatbot That Teaches Python Like a Pro ??

?? 7 Python AI Projects Every Beginner Should Try Today ??

Machine Learning with (Monty) Python

Natural Language Programming: A Semantic Assembly

Develop AI Using Python: A Step-by-Step Guide

Introducing the Revolutionary Self-Modifying GPT Python Script!

How to Perform Zero-Shot Text Classification Using Hugging Face Transformers Library in Python

Top Programming languages for AI Development