TF-IDF (Term Frequency - Inverse Document Frequency) in NLP.
TF-IDF is a statistical measure used in information retrieval and text mining to evaluate the importance of a word in a document relative to a collection of documents (or more formally a corpus). In a nutshell, the idea is that if a word appears more frequently in a document but not in many other documents, it is probably very relevant to that specific document.
This article will be involved in terms of mathematical formulas and the code. So I recommend going through it slowly.
Before understanding TF-IDF, let's look at Text Feature Extraction and some terminologies related to it.
Text Feature Extraction
Text feature extraction in Natural Language Processing (NLP) refers to the process of transforming raw text data into numerical representations or features that machine learning models can understand and process. This is a crucial step in NLP pipelines as it allows the algorithms to learn from and make predictions based on text data.
Count Vectorization
Count Vectorization (also known as Count Vectors) is a method of converting text into numerical features by counting the occurrences of each word in the text. This method captures the frequency of each word within a document but does not account for word order or semantics.
Tokenization
The text is split into individual words or tokens. For example, "The cat sat on the mat" would be tokenized into ["The", "cat", "sat", "on", "the", "mat"]. This is the most simplest form of tokenization.
Vocabulary Creation
A vocabulary is built from all unique tokens in the corpus. For instance, if the corpus consists of multiple documents, the vocabulary might include all unique words across all documents.
Vectorization
Each document is represented as a vector where each dimension corresponds to a word in the vocabulary. The value in each dimension is the count of occurrences of that word in the document. For example, if the vocabulary is ["The", "cat", "sat", "on", "the", "mat"], the document "The cat sat on the mat" might be represented as [2, 1, 1, 1, 1, 1], where the first dimension corresponds to "The", and the count is 2 because "The" appears twice.
For a small corpus with two documents: Document 1: "The cat sat on the mat" Document 2: "The cat is on the mat" The vocabulary might be ["The", "cat", "sat", "on", "the", "mat", "is"]. Notice that "The" and "the" are considered different due to case sensitivity.
The Count Vectors might look like:
Document 1: [2, 1, 1, 1, 1, 1, 0] Document 2: [2, 1, 0, 1, 1, 1, 1]
Document-Term Matrix
The Document-Term Matrix (DTM) is a tabular representation of the text data where rows represent documents and columns represent terms (words). Each cell in the matrix indicates the count of a specific term in a specific document.
The DTM is essentially a sparse matrix because most of its values are zero (especially in large vocabularies). Example: Using the same documents as above: The cat sat on the mat is
Now let's switch back to our main topic which is TF-IDF
Term Frequency (TF)
Term Frequency measures how frequently a term occurs in a document. The simplest form of TF is:
Here "t" is the term or the word and "d" is the document. TF provides a basic measure of how important a term is within a specific document, but it doesn’t account for the term’s importance across the entire corpus.
Inverse Document Frequency (IDF)
Inverse Document Frequency measures the importance of a term across the whole corpus. It helps to adjust for the fact that some words are very common and may not be very informative. The IDF of a term is calculated as:
Here "|D|" is the total number of documents in the corpus and "t" is the count of documents that contain the term.
TF-IDF Calculation
The TF-IDF score for a term in a document is the product of its TF and IDF values:
This score reflects both the term's importance in the specific document and its rarity across the corpus.
Now let's go through an example to better understand the TF-IDF formula. Imagine we have three documents:
In order to calculate the term frequency for the word "cat" we will utilize the formula.
领英推荐
For the term "cat" in document 2
In order to calculate the Inverse document frequency of the word "cat" we will use the IDF formula.
Now let's calculate the TF-IDF for "cat" in document 1:
For "cat" in the second document
In this example, the TF-IDF score for "cat" in both documents is the same. If "cat" appeared in fewer documents, its IDF would be higher, and thus its TF-IDF score would be higher as well, reflecting its greater importance in those specific documents.
Fortunately, all of these calculations can be performed through scikit-Learn but we will also implement it from scratch as well. The companion notebook is also available on GitHub (Link is shared at the bottom of the article).
TF-DF implementation from scratch
# let's import the necessary libraries
import numpy as np
import pandas as pd
import spacy
# we only want the spacy tokenizer, so disable everything else
nlp = spacy.load(
"en_core_web_md",
disable=["tagger", "parser", "attribute_ruler", "lemmatizer", "ner"]
)
# OUTPUT: ['tok2vec']
print(nlp.pipe_names)
# let's define a custom function that will tokenize out text and remove punctuation
def tokenize_and_remove_punkt(text):
doc = nlp(text)
tokens = [token.text for token in doc if not token.is_punct]
return tokens
The dataset is taken from here Medium Articles Kaggle
# let's load in the dataset and take a sneak peek (I have already downloaded the data)
df = pd.read_csv("./medium_articles.csv")
df.head(5)
Recall, that before doing anything, we need to build the vocabulary dictionary and vocabulary contains all the unique words from the corpus. Subsequently, we also need to convert the documents to numerical form.
# create vocab and convert the docs to numerical form
vocab = {}
idx = 0
tokenized_docs = []
for doc in df["text"]:
tokens = tokenize_and_remove_punkt(doc.lower())
doc_tokens = []
for token in tokens:
if token not in vocab:
vocab[token] = idx
idx += 1
doc_tokens.append(vocab[token])
tokenized_docs.append(doc_tokens)
# reverse mapping (index to word)
words = [key for key in vocab.keys()]
Recall, that the term frequency document needs to store the frequency of word occurring in every document, hence we need to create a matrix where the no of rows should be equal to the number of documents and the number of columns should be equal to the size of vocabulary. Example image is show below for better understanding.
# N = no of documents, V = size of vocabulary
N = len(df["text"])
V = len(vocab)
# term frequency matrix (dense)
term_freq = np.zeros((N, V))
# fill the term frequency matrix with the occurrence of words
for doc_idx, tokenized_doc in enumerate(tokenized_docs):
for token_idx in tokenized_doc:
term_freq[doc_idx, token_idx] += 1
Now that we have calculated the term frequency matrix, we need to calculate the document frequency vector, followed by IDF.
# calculate IDF (inverse document frequency)
doc_freq = np.sum(term_freq > 0, axis=0)
# numpy will automatically broadcast i.e. divide each doc_freq value with N
idf = np.log(N / doc_freq)
# each document row will be multiplied with the idf vector
tf_idf = term_freq * idf
The line doc_freq = np.sum(term_freq > 0, axis=0) counts the occurrence of a specific word in all documents, for example in the above table "this" appears in all the document (hence it's document frequency is 3).
Now that we have calculated everything, let's test this out by randomly choosing an article from our dataset and getting the top 5 words.
# let's test this out
random_idx = np.random.choice(N)
row = df.iloc[random_idx]
print("Label: ", row["title"].split("\n", 1)[0])
print("\n")
print("Starting text: ", row["text"].split("\n", 1)[0])
scores = tf_idf[random_idx]
top_ten = (-scores).argsort()[:11]
print("\n")
print("Top ten words: ", [words[idx] for idx in top_ten])
The calculations were done on dense matrix which may not be memory efficient, the companion notebook also includes implementation using sparse matrix from Scipy.
from collections import defaultdict
from scipy.sparse import csr_matrix
# for scipy the method is a little different
data = []
rows = []
cols = []
for doc_idx, tokenized_doc in enumerate(tokenized_docs):
term_counts = defaultdict(int)
for token_idx in tokenized_doc:
term_counts[token_idx] += 1
for token_idx, count in term_counts.items():
data.append(count)
rows.append(doc_idx)
cols.append(token_idx)
sparse_term_freq = csr_matrix((data, (rows, cols)), shape=(N, V))
binary_term_freq = (sparse_term_freq > 0).astype(int)
# sum along the cols and convert to 1D array by flattening it
document_freq = np.array(binary_term_freq.sum(axis=0)).flatten()
idf = np.log(N / document_freq)
# we use the .multiply method instead of using "*" operator directly
tf_idf = sparse_term_freq.multiply(idf)
rand_idx = np.random.choice(N)
row = df.iloc[rand_idx]
print("Label: ", row["title"].split("\n", 1)[0])
print("\n")
print("Starting text: ", row["text"].split("\n", 1)[0])
scores = tf_idf.getrow(rand_idx).toarray().flatten()
top_five = (-scores).argsort()[:6]
print("\n")
print("Top five words: ", [words[idx] for idx in top_five])
Now that we have seen how the formulas work, let's look at TfidfVectorizer class from scikit-learn.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(
max_df=0.9,
min_df=0.1,
max_features=2000
)
# assuming you have already loaded the dataset given in the notebook
tfidf_mat = tfidf_vectorizer.fit_transform(df["text"])
# reverse mapping (idx to word)
feature_names = tfidf_vectorizer.get_feature_names_out()
rand_idx = np.random.choice(tfidf_mat.shape[0])
row = df.iloc[rand_idx]
print("Label: ", row["title"].split("\n", 1)[0])
print("\n")
print("Starting text: ", row["text"].split("\n", 1)[0])
tfidf_vector = tfidf_mat.getrow(rand_idx).toarray().flatten()
top_indices = (-tfidf_vector).argsort()[:6]
top_words = [feature_names[i] for i in top_indices]
print("\n")
print("Top ten words: ", top_words)
And this marks the end of the article. Don't forget to check out the companion Notebook as it contains the whole code in a single place.