Document Similarity with Examples in Python
Rany ElHousieny, PhD???
Generative AI Engineering Manager | ex-Microsoft | AI Solutions Architect | Expert in LLM, NLP, and AI-Driven Innovation | AI Product Leader
Document similarity is a crucial concept in natural language processing (NLP) that measures how closely two or more documents are related in terms of their content. It is widely used in applications such as search engines, recommendation systems, and plagiarism detection. This article will explore different methods to calculate document similarity and demonstrate their implementation in Python using examples.
1. Cosine Similarity
Cosine similarity is a popular method for measuring the similarity between two documents. It calculates the cosine of the angle between two vectors, which represent the documents in a multi-dimensional space. The cosine value ranges from -1 to 1, where 1 indicates identical documents, and -1 indicates completely dissimilar documents.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Sample documents
doc1 = "The sky is blue."
doc2 = "The sun is bright."
# Vectorize the documents
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([doc1, doc2])
# Calculate cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
print(f"Cosine Similarity: {cosine_sim[0][1]}")
Cosine Similarity: 0.3360969272762575
Here's a breakdown of each step:
Import necessary modules:
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVectorizer: A class from scikit-learn's feature_extraction.text module that converts a collection of raw documents into a matrix of TF-IDF features.
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity: A function from scikit-learn's metrics.pairwise module that calculates the cosine similarity between samples in a given dataset.
Define sample documents:
doc1 = "The sky is blue."
doc2 = "The sun is bright."
These are two simple text documents that we want to compare for similarity.
Vectorize the documents:
vectorizer = TfidfVectorizer()
TfidfVectorizer() is instantiated to create a vectorizer object.
tfidf_matrix = vectorizer.fit_transform([doc1, doc2])
fit_transform() method is used to transform the documents into a TF-IDF matrix. This matrix represents the documents in a numerical form suitable for similarity calculations. Each row corresponds to a document, and each column represents a unique word in the document corpus. The values in the matrix are the TF-IDF scores, which reflect the importance of each word in each document.
The output above is a DataFrame representing the TF-IDF matrix for the two sample documents:
- Rows: Each row corresponds to a document. Row 0 represents the first document ("The sky is blue."), and row 1 represents the second document ("The sun is bright.").
- Columns: Each column corresponds to a unique word (feature) extracted from the documents. The words are "blue," "bright," "is," "sky," "sun," and "the."
- Values: The values in the DataFrame are the TF-IDF scores for each word in each document. These scores represent the importance of each word in the context of the corresponding document.
For example:
- The value 0.576152 in row 0, column "blue" indicates the TF-IDF score of the word "blue" in the first document. This score is relatively high, reflecting that "blue" is an important and distinctive word in the context of the first document.
- The value 0.000000 in row 0, column "bright" indicates that the word "bright" does not appear in the first document, hence its TF-IDF score is 0.
- The value 0.409937 in row 0, column "is" indicates the TF-IDF score of the word "is" in the first document. This score is lower than that of "blue" because "is" is a more common word and likely carries less distinctive information.
The TF-IDF matrix provides a numerical representation of the documents that can be used for various NLP tasks, such as calculating document similarity, as demonstrated earlier with cosine similarity.
Calculate cosine similarity:
cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
The cosine_similarity() function is used to calculate the cosine similarity between the first document (tfidf_matrix[0:1], which is a slice representing the first row of the matrix) and all documents in the matrix (including itself).The result, cosine_sim, is a matrix where each element represents the cosine similarity score between the first document and another document. In this case, since there are only two documents, the matrix will have two scores: one for the similarity between the first document and itself (which will be 1, as any document is perfectly similar to itself), and one for the similarity between the first and the second document.
The reason for TF-IDF is to convert the textual information into a numerical format (TF-IDF matrix) that can be used to calculate the cosine similarity, a measure of similarity between two non-zero vectors in an inner product space that is widely used in NLP to compare documents.
2. Jaccard Similarity
Jaccard similarity measures the similarity between two sets by dividing the size of their intersection by the size of their union. In the context of documents, it compares the sets of words present in each document.
def jaccard_similarity(doc1, doc2):
words_doc1 = set(doc1.split())
words_doc2 = set(doc2.split())
intersection = words_doc1.intersection(words_doc2)
union = words_doc1.union(words_doc2)
return len(intersection) / len(union)
# Sample documents
doc1 = "The sky is blue."
doc2 = "The sun is bright."
# Calculate Jaccard similarity
jaccard_sim = jaccard_similarity(doc1, doc2)
print(f"Jaccard Similarity: {jaccard_sim}")
3. Euclidean Distance
Euclidean distance is a measure of the straight-line distance between two points in a multi-dimensional space. In document similarity, it is used to measure the distance between the vector representations of the documents. A smaller distance indicates higher similarity.
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import euclidean
# Sample documents
doc1 = "The sky is blue."
doc2 = "The sun is bright."
# Vectorize the documents
vectorizer = CountVectorizer()
count_matrix = vectorizer.fit_transform([doc1, doc2])
# Convert to dense matrix
dense_matrix = count_matrix.toarray()
# Calculate Euclidean distance
euclidean_dist = euclidean(dense_matrix[0], dense_matrix[1])
print(f"Euclidean Distance: {euclidean_dist}")
Modern and Advanced Methods for Document Similarity
There are more modern and advanced methods for document similarity that have emerged with the advent of deep learning and neural networks. Some of these methods include:
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Sample documents
documents = ["The sky is blue", "The sun is bright"]
# Tokenize documents
tokenized_docs = [doc.split() for doc in documents]
# Train Word2Vec model
model = Word2Vec(sentences=tokenized_docs, vector_size=100, window=5, min_count=1, workers=4)
# Calculate average vector for each document
def average_vector(doc):
return np.mean([model.wv[word] for word in doc if word in model.wv], axis=0)
doc_vectors = [average_vector(doc) for doc in tokenized_docs]
# Calculate cosine similarity
similarity = cosine_similarity([doc_vectors[0]], [doc_vectors[1]])
print("Cosine similarity (Word2Vec):", similarity[0][0])
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# Load pre-trained BERT model for sentence embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
# Sample sentences
sentences = ["The sky is blue", "The sun is bright"]
# Generate embeddings
embeddings = model.encode(sentences)
# Calculate cosine similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])
print("Cosine similarity (BERT):", similarity[0][0])
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Lambda
from tensorflow.keras import backend as K
def euclidean_distance(vects):
x, y = vects
sum_square = K.sum(K.square(x - y), axis=1, keepdims=True)
return K.sqrt(K.maximum(sum_square, K.epsilon()))
# Define inputs
input_a = Input(shape=(100,))
input_b = Input(shape=(100,))
# Shared dense layer
shared_layer = Dense(64, activation='relu')
encoded_a = shared_layer(input_a)
encoded_b = shared_layer(input_b)
# Euclidean distance
distance = Lambda(euclidean_distance)([encoded_a, encoded_b])
# Siamese network
siamese_net = Model(inputs=[input_a, input_b], outputs=distance)
# Compile model
siamese_net.compile(optimizer='adam', loss='mean_squared_error')
Note: This is just a basic structure. A real implementation would require a dataset for training and a way to generate positive and negative pairs.
import openai
from sklearn.metrics.pairwise import cosine_similarity
# Set your OpenAI API key
openai.api_key = 'your-api-key'
# Sample documents
documents = ["The sky is blue", "The sun is bright"]
# Encode documents using GPT-3
embeddings = []
for doc in documents:
response = openai.Embedding.create(input=doc, engine="text-similarity-babbage-001")
# Calculate cosine similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])
print("Cosine similarity (GPT-3):", similarity[0][0])
from sentence_transformers import SentenceTransformer
import hdbscan
# Load pre-trained BERT model for sentence embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
# Sample sentences
sentences = ["The sky is blue", "The sun is bright", "The ocean is deep"]
# Generate embeddings
embeddings = model.encode(sentences)
These modern methods often provide better results in terms of capturing semantic similarity, handling polysemy (words with multiple meanings), and dealing with different lengths of documents. However, they also require more computational resources and may involve more complex preprocessing and fine-tuning steps.
Document similarity is a fundamental concept in NLP with various applications. In this document, we explored three common methods for measuring document similarity: cosine similarity, Jaccard similarity, and Euclidean distance. Each method has its own advantages and use cases, and the choice of method depends on the specific requirements of the application. By implementing these methods in Python, we can effectively compare and analyze the similarity between documents.