TF-IDF (Term Frequency - Inverse Document Frequency) in NLP.

TF-IDF (Term Frequency - Inverse Document Frequency) in NLP.

TF-IDF is a statistical measure used in information retrieval and text mining to evaluate the importance of a word in a document relative to a collection of documents (or more formally a corpus). In a nutshell, the idea is that if a word appears more frequently in a document but not in many other documents, it is probably very relevant to that specific document.

This article will be involved in terms of mathematical formulas and the code. So I recommend going through it slowly.

Before understanding TF-IDF, let's look at Text Feature Extraction and some terminologies related to it.

Text Feature Extraction

Text feature extraction in Natural Language Processing (NLP) refers to the process of transforming raw text data into numerical representations or features that machine learning models can understand and process. This is a crucial step in NLP pipelines as it allows the algorithms to learn from and make predictions based on text data.

Count Vectorization

Count Vectorization (also known as Count Vectors) is a method of converting text into numerical features by counting the occurrences of each word in the text. This method captures the frequency of each word within a document but does not account for word order or semantics.

Tokenization

The text is split into individual words or tokens. For example, "The cat sat on the mat" would be tokenized into ["The", "cat", "sat", "on", "the", "mat"]. This is the most simplest form of tokenization.

Vocabulary Creation

A vocabulary is built from all unique tokens in the corpus. For instance, if the corpus consists of multiple documents, the vocabulary might include all unique words across all documents.

Vectorization

Each document is represented as a vector where each dimension corresponds to a word in the vocabulary. The value in each dimension is the count of occurrences of that word in the document. For example, if the vocabulary is ["The", "cat", "sat", "on", "the", "mat"], the document "The cat sat on the mat" might be represented as [2, 1, 1, 1, 1, 1], where the first dimension corresponds to "The", and the count is 2 because "The" appears twice.

For a small corpus with two documents: Document 1: "The cat sat on the mat" Document 2: "The cat is on the mat" The vocabulary might be ["The", "cat", "sat", "on", "the", "mat", "is"]. Notice that "The" and "the" are considered different due to case sensitivity.

The Count Vectors might look like:

Document 1: [2, 1, 1, 1, 1, 1, 0] Document 2: [2, 1, 0, 1, 1, 1, 1]

Document-Term Matrix

The Document-Term Matrix (DTM) is a tabular representation of the text data where rows represent documents and columns represent terms (words). Each cell in the matrix indicates the count of a specific term in a specific document.

The DTM is essentially a sparse matrix because most of its values are zero (especially in large vocabularies). Example: Using the same documents as above: The cat sat on the mat is

Document Term Matrix (Each column represents the word from the vocab (Table 1.1)

Now let's switch back to our main topic which is TF-IDF

Term Frequency (TF)

Term Frequency measures how frequently a term occurs in a document. The simplest form of TF is:

Term Frequency formula

Here "t" is the term or the word and "d" is the document. TF provides a basic measure of how important a term is within a specific document, but it doesn’t account for the term’s importance across the entire corpus.

Inverse Document Frequency (IDF)

Inverse Document Frequency measures the importance of a term across the whole corpus. It helps to adjust for the fact that some words are very common and may not be very informative. The IDF of a term is calculated as:

Inverse Document Frequency Formula

Here "|D|" is the total number of documents in the corpus and "t" is the count of documents that contain the term.

TF-IDF Calculation

The TF-IDF score for a term in a document is the product of its TF and IDF values:

TF-IDF formula

This score reflects both the term's importance in the specific document and its rarity across the corpus.

Now let's go through an example to better understand the TF-IDF formula. Imagine we have three documents:

  • Document 1: "The cat sat on the mat"
  • Document 2: "The cat is on the mat"
  • Document 3: "The dog sat on the log"

In order to calculate the term frequency for the word "cat" we will utilize the formula.

The word "cat" occurs one time in the first document and 6 is the total number of words

For the term "cat" in document 2

The word "cat" occurs one time in the second document and 6 is the total number of words

In order to calculate the Inverse document frequency of the word "cat" we will use the IDF formula.

Total number of documents: 3, Documents containing the term "cat" 2

Now let's calculate the TF-IDF for "cat" in document 1:

Using the results from previous calculations, we get 0.029 as the result

For "cat" in the second document

Using the results from previous calculations, we get 0.029 as the result

In this example, the TF-IDF score for "cat" in both documents is the same. If "cat" appeared in fewer documents, its IDF would be higher, and thus its TF-IDF score would be higher as well, reflecting its greater importance in those specific documents.

Fortunately, all of these calculations can be performed through scikit-Learn but we will also implement it from scratch as well. The companion notebook is also available on GitHub (Link is shared at the bottom of the article).

TF-DF implementation from scratch

# let's import the necessary libraries
import numpy as np
import pandas as pd
import spacy        
# we only want the spacy tokenizer, so disable everything else
nlp = spacy.load(
    "en_core_web_md",
    disable=["tagger", "parser", "attribute_ruler", "lemmatizer", "ner"]
)

# OUTPUT: ['tok2vec']
print(nlp.pipe_names)        
# let's define a custom function that will tokenize out text and remove punctuation
def tokenize_and_remove_punkt(text):
    doc = nlp(text)
    tokens = [token.text for token in doc if not token.is_punct]
    return tokens        

The dataset is taken from here Medium Articles Kaggle

# let's load in the dataset and take a sneak peek (I have already downloaded the data)
df = pd.read_csv("./medium_articles.csv")
df.head(5)        

Recall, that before doing anything, we need to build the vocabulary dictionary and vocabulary contains all the unique words from the corpus. Subsequently, we also need to convert the documents to numerical form.

# create vocab and convert the docs to numerical form
vocab = {}
idx = 0
tokenized_docs = []

for doc in df["text"]:
    tokens = tokenize_and_remove_punkt(doc.lower())
    doc_tokens = []
    
    for token in tokens:
        if token not in vocab:
            vocab[token] = idx
            idx += 1
        
        doc_tokens.append(vocab[token])
    
    tokenized_docs.append(doc_tokens)

# reverse mapping (index to word)
words = [key for key in vocab.keys()]        

Recall, that the term frequency document needs to store the frequency of word occurring in every document, hence we need to create a matrix where the no of rows should be equal to the number of documents and the number of columns should be equal to the size of vocabulary. Example image is show below for better understanding.

Illustration of a term frequency matrix (Table 1.2)
# N = no of documents, V = size of vocabulary
N = len(df["text"])
V = len(vocab)

# term frequency matrix (dense)
term_freq = np.zeros((N, V))

# fill the term frequency matrix with the occurrence of words
for doc_idx, tokenized_doc in enumerate(tokenized_docs):
    for token_idx in tokenized_doc:
        term_freq[doc_idx, token_idx] += 1        

Now that we have calculated the term frequency matrix, we need to calculate the document frequency vector, followed by IDF.

# calculate IDF (inverse document frequency)
doc_freq = np.sum(term_freq > 0, axis=0)

# numpy will automatically broadcast i.e. divide each doc_freq value with N
idf = np.log(N / doc_freq)

# each document row will be multiplied with the idf vector
tf_idf = term_freq * idf        

The line doc_freq = np.sum(term_freq > 0, axis=0) counts the occurrence of a specific word in all documents, for example in the above table "this" appears in all the document (hence it's document frequency is 3).

Now that we have calculated everything, let's test this out by randomly choosing an article from our dataset and getting the top 5 words.

# let's test this out
random_idx = np.random.choice(N)
row = df.iloc[random_idx]

print("Label: ", row["title"].split("\n", 1)[0])
print("\n")
print("Starting text: ", row["text"].split("\n", 1)[0])

scores = tf_idf[random_idx]
top_ten = (-scores).argsort()[:11]

print("\n")
print("Top ten words: ", [words[idx] for idx in top_ten])        
The top ten words are not bad, we are getting some good results

The calculations were done on dense matrix which may not be memory efficient, the companion notebook also includes implementation using sparse matrix from Scipy.

from collections import defaultdict
from scipy.sparse import csr_matrix        
# for scipy the method is a little different
data = []
rows = []
cols = []

for doc_idx, tokenized_doc in enumerate(tokenized_docs):
    term_counts = defaultdict(int)
    
    for token_idx in tokenized_doc:
        term_counts[token_idx] += 1
    
    for token_idx, count in term_counts.items():
        data.append(count)
        rows.append(doc_idx)
        cols.append(token_idx)        
sparse_term_freq = csr_matrix((data, (rows, cols)), shape=(N, V))

binary_term_freq = (sparse_term_freq > 0).astype(int)

# sum along the cols and convert to 1D array by flattening it
document_freq = np.array(binary_term_freq.sum(axis=0)).flatten()

idf = np.log(N / document_freq)

# we use the .multiply method instead of using "*" operator directly
tf_idf = sparse_term_freq.multiply(idf)        
rand_idx = np.random.choice(N)
row = df.iloc[rand_idx]

print("Label: ", row["title"].split("\n", 1)[0])
print("\n")
print("Starting text: ", row["text"].split("\n", 1)[0])

scores = tf_idf.getrow(rand_idx).toarray().flatten()
top_five = (-scores).argsort()[:6]

print("\n")
print("Top five words: ", [words[idx] for idx in top_five])        
We get the same results (by the way kurzweil is company that produces music systems)

Now that we have seen how the formulas work, let's look at TfidfVectorizer class from scikit-learn.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(
    max_df=0.9,
    min_df=0.1,
    max_features=2000
)

# assuming you have already loaded the dataset given in the notebook
tfidf_mat = tfidf_vectorizer.fit_transform(df["text"])

# reverse mapping (idx to word)
feature_names = tfidf_vectorizer.get_feature_names_out()

rand_idx = np.random.choice(tfidf_mat.shape[0])
row = df.iloc[rand_idx]

print("Label: ", row["title"].split("\n", 1)[0])
print("\n")
print("Starting text: ", row["text"].split("\n", 1)[0])

tfidf_vector = tfidf_mat.getrow(rand_idx).toarray().flatten()

top_indices = (-tfidf_vector).argsort()[:6]

top_words = [feature_names[i] for i in top_indices]

print("\n")
print("Top ten words: ", top_words)        
Top ten words for a randomly selected article

And this marks the end of the article. Don't forget to check out the companion Notebook as it contains the whole code in a single place.

要查看或添加评论,请登录

Ali Raza的更多文章

社区洞察

其他会员也浏览了