Python Cosine Similarity

Cosine Similarity is used as a metric for measuring distance when the magnitude of vector** does not matter. Example - Text data represented by count of worlds

**The magnitude of a vector refers to the length or size of the vector. It also determines its direction

There is one commonly used approach to check similarity of documents by the count of maximum number of common worlds between the document but it has one flaw - once the document size increases the common world tend to increase but both documents are talking about different topics.

The Cosine Similarity helps to overcome this flaw.

Mathematically cos θ = x ? y ( ‖ x ‖ ) ( ‖ y ‖ )

TF-IDF Algorithm

What is TF-IDF? Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in natural language processing and information retrieval. It measures how important a term is within a document relative to a collection of documents (i.e., relative to a corpus)

The Computer can understand the data only in the numerical value, for this reason we vectorize all the text so that computer understand the text better.

Math of TF-IDF

TF-IDF= Term Frequency * Inverse Document Frequency

t- Term (world)

d-Document(set of worlds)

n-Count of Corpus

corpus -Total document

TERM FREQUENCY

  • This measures the frequency of a word in a document.
  • When we are vectorizing the documents, we check for each words count. In worst case if the term doesn't exist in the document, then that particular TF value will be 0 and in other extreme case, if all the words in the document are same, then it will be 1. The final value of the normalized TF value will be in the range of [0 to 1]. 0, 1 inclusive.
  • Term frequency (TF) is how often a word appears in a document, divided by how many words there are.
  • TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

DOCUMENT FREQUENCY

  • This measures the importance of document in whole set of corpus, this is very similar to TF. The only difference is that TF is frequency counter for a term t in document d, where as DF is the count of occurrences of term t in the document set N.
  • DF is the number of documents in which the word is present.

INVERSE DOCUMENT FREQUENCY

  • Term frequency is how common a word is, inverse document frequency (IDF) is how unique or rare a word is.
  • IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
  • TF-IDF weight is the product of term frequency and inverse document frequency.

Python Library:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity        

stopwords:

Nltk stop words are widely used words (such as “the,” “a,” “an,” or “in”) that a search engine has been configured to disregard while indexing and retrieving entries

Tokenizer:

Tokenizer is a compact pure-Python (>= 3.8) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc

Lemmatizer :

Lemmatization is the process of converting a word to its base form.

The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

For example, lemmatization would correctly identify the base form of ‘caring’ to ‘care’, whereas, stemming would cutoff the ‘ing’ part and convert it to car. ‘Caring’ -> Lemmatization -> ‘Care’ ‘Caring’ -> Stemming -> ‘Car’

Sanju Dalla

2023 & 2024 LinkedIn Top Quality Assurance Voice , Head of Digital Assurance at GlobalLogic, Driving Business Quality with Speed for GenAI Platforms

1 年

Nice article Vivek! #GenAIQualityAssurance

回复

要查看或添加评论,请登录

Vivek Sharma的更多文章

  • LLM and Evaluation Methods

    LLM and Evaluation Methods

  • Hallucination In AI Models

    Hallucination In AI Models

    Hallucination in Natural Language Generation: Hallucination occurs when a language model generates text that fits a…

  • Cypress Best Practices

    Cypress Best Practices

    Use data attribute when selecting the Elements, Automation QA can add data attribute in the code if not available that…

  • Robot Framework - Template usage in API automation

    Robot Framework - Template usage in API automation

    In Robot Framework the Test templates convert normal keyword driven test cases into data driven tests. Use Case : In…

    1 条评论
  • Listeners in TestNG

    Listeners in TestNG

    If you want to add logs into your test script you can use listeners. Listeners work on action so have below methods.

社区洞察

其他会员也浏览了