Document Similarity with Examples in Python

Document Similarity with Examples in Python

Document similarity is a crucial concept in natural language processing (NLP) that measures how closely two or more documents are related in terms of their content. It is widely used in applications such as search engines, recommendation systems, and plagiarism detection. This article will explore different methods to calculate document similarity and demonstrate their implementation in Python using examples.

1. Cosine Similarity

Cosine similarity is a popular method for measuring the similarity between two documents. It calculates the cosine of the angle between two vectors, which represent the documents in a multi-dimensional space. The cosine value ranges from -1 to 1, where 1 indicates identical documents, and -1 indicates completely dissimilar documents.

https://youtu.be/e9U0QAFbfLI?si=tEG1GpqDkFX05Y_I

Example:

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

# Sample documents

doc1 = "The sky is blue."

doc2 = "The sun is bright."

# Vectorize the documents

vectorizer = TfidfVectorizer()

tfidf_matrix = vectorizer.fit_transform([doc1, doc2])

# Calculate cosine similarity

cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)

print(f"Cosine Similarity: {cosine_sim[0][1]}")        
Cosine Similarity: 0.3360969272762575        


Here's a breakdown of each step:

Import necessary modules:

from sklearn.feature_extraction.text import TfidfVectorizer         

TfidfVectorizer: A class from scikit-learn's feature_extraction.text module that converts a collection of raw documents into a matrix of TF-IDF features.

from sklearn.metrics.pairwise import cosine_similarity        

cosine_similarity: A function from scikit-learn's metrics.pairwise module that calculates the cosine similarity between samples in a given dataset.


Define sample documents:


doc1 = "The sky is blue." 
doc2 = "The sun is bright."        

These are two simple text documents that we want to compare for similarity.


Vectorize the documents:


vectorizer = TfidfVectorizer()         

TfidfVectorizer() is instantiated to create a vectorizer object.

tfidf_matrix = vectorizer.fit_transform([doc1, doc2])        

fit_transform() method is used to transform the documents into a TF-IDF matrix. This matrix represents the documents in a numerical form suitable for similarity calculations. Each row corresponds to a document, and each column represents a unique word in the document corpus. The values in the matrix are the TF-IDF scores, which reflect the importance of each word in each document.

The output above is a DataFrame representing the TF-IDF matrix for the two sample documents:

- Rows: Each row corresponds to a document. Row 0 represents the first document ("The sky is blue."), and row 1 represents the second document ("The sun is bright.").

- Columns: Each column corresponds to a unique word (feature) extracted from the documents. The words are "blue," "bright," "is," "sky," "sun," and "the."

- Values: The values in the DataFrame are the TF-IDF scores for each word in each document. These scores represent the importance of each word in the context of the corresponding document.

For example:

- The value 0.576152 in row 0, column "blue" indicates the TF-IDF score of the word "blue" in the first document. This score is relatively high, reflecting that "blue" is an important and distinctive word in the context of the first document.

- The value 0.000000 in row 0, column "bright" indicates that the word "bright" does not appear in the first document, hence its TF-IDF score is 0.

- The value 0.409937 in row 0, column "is" indicates the TF-IDF score of the word "is" in the first document. This score is lower than that of "blue" because "is" is a more common word and likely carries less distinctive information.

The TF-IDF matrix provides a numerical representation of the documents that can be used for various NLP tasks, such as calculating document similarity, as demonstrated earlier with cosine similarity.


Calculate cosine similarity:

cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)        

The cosine_similarity() function is used to calculate the cosine similarity between the first document (tfidf_matrix[0:1], which is a slice representing the first row of the matrix) and all documents in the matrix (including itself).The result, cosine_sim, is a matrix where each element represents the cosine similarity score between the first document and another document. In this case, since there are only two documents, the matrix will have two scores: one for the similarity between the first document and itself (which will be 1, as any document is perfectly similar to itself), and one for the similarity between the first and the second document.


The reason for TF-IDF is to convert the textual information into a numerical format (TF-IDF matrix) that can be used to calculate the cosine similarity, a measure of similarity between two non-zero vectors in an inner product space that is widely used in NLP to compare documents.


2. Jaccard Similarity

Jaccard similarity measures the similarity between two sets by dividing the size of their intersection by the size of their union. In the context of documents, it compares the sets of words present in each document.

Example:

def jaccard_similarity(doc1, doc2):

    words_doc1 = set(doc1.split())

    words_doc2 = set(doc2.split())

    intersection = words_doc1.intersection(words_doc2)

    union = words_doc1.union(words_doc2)

    return len(intersection) / len(union)

# Sample documents

doc1 = "The sky is blue."

doc2 = "The sun is bright."

# Calculate Jaccard similarity

jaccard_sim = jaccard_similarity(doc1, doc2)

print(f"Jaccard Similarity: {jaccard_sim}")        



3. Euclidean Distance

Euclidean distance is a measure of the straight-line distance between two points in a multi-dimensional space. In document similarity, it is used to measure the distance between the vector representations of the documents. A smaller distance indicates higher similarity.

Example:

from sklearn.feature_extraction.text import CountVectorizer

from scipy.spatial.distance import euclidean

# Sample documents

doc1 = "The sky is blue."

doc2 = "The sun is bright."

# Vectorize the documents

vectorizer = CountVectorizer()

count_matrix = vectorizer.fit_transform([doc1, doc2])

# Convert to dense matrix

dense_matrix = count_matrix.toarray()

# Calculate Euclidean distance

euclidean_dist = euclidean(dense_matrix[0], dense_matrix[1])

print(f"Euclidean Distance: {euclidean_dist}")        


Modern and Advanced Methods for Document Similarity

There are more modern and advanced methods for document similarity that have emerged with the advent of deep learning and neural networks. Some of these methods include:

  • Word Embeddings (e.g., Word2Vec, GloVe): Word embeddings are dense vector representations of words, where similar words have similar vectors. Document similarity can be calculated by averaging the embeddings of all words in each document and then computing the cosine similarity between these average vectors.

from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Sample documents
documents = ["The sky is blue", "The sun is bright"]

# Tokenize documents
tokenized_docs = [doc.split() for doc in documents]

# Train Word2Vec model
model = Word2Vec(sentences=tokenized_docs, vector_size=100, window=5, min_count=1, workers=4)

# Calculate average vector for each document
def average_vector(doc):
    return np.mean([model.wv[word] for word in doc if word in model.wv], axis=0)

doc_vectors = [average_vector(doc) for doc in tokenized_docs]

# Calculate cosine similarity
similarity = cosine_similarity([doc_vectors[0]], [doc_vectors[1]])
print("Cosine similarity (Word2Vec):", similarity[0][0])
        

  • Sentence and Document Embeddings (e.g., Doc2Vec, BERT): Similar to word embeddings, sentence and document embeddings provide dense vector representations for entire sentences or documents. Pre-trained models like BERT (Bidirectional Encoder Representations from Transformers) can be used to generate these embeddings, which can then be used to calculate document similarity.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained BERT model for sentence embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample sentences
sentences = ["The sky is blue", "The sun is bright"]

# Generate embeddings
embeddings = model.encode(sentences)

# Calculate cosine similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])
print("Cosine similarity (BERT):", similarity[0][0])
        

  • Siamese Networks: Siamese networks are neural network architectures that can be trained to learn a similarity function. They can be used to directly learn the similarity between two documents based on their content.Siamese networks are more complex to implement and typically require a dataset for training. Here's a simplified example of how you might set up a Siamese network using TensorFlow and Keras:

import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Lambda
from tensorflow.keras import backend as K

def euclidean_distance(vects):
    x, y = vects
    sum_square = K.sum(K.square(x - y), axis=1, keepdims=True)
    return K.sqrt(K.maximum(sum_square, K.epsilon()))

# Define inputs
input_a = Input(shape=(100,))
input_b = Input(shape=(100,))

# Shared dense layer
shared_layer = Dense(64, activation='relu')
encoded_a = shared_layer(input_a)
encoded_b = shared_layer(input_b)

# Euclidean distance
distance = Lambda(euclidean_distance)([encoded_a, encoded_b])

# Siamese network
siamese_net = Model(inputs=[input_a, input_b], outputs=distance)

# Compile model
siamese_net.compile(optimizer='adam', loss='mean_squared_error')        

Note: This is just a basic structure. A real implementation would require a dataset for training and a way to generate positive and negative pairs.


  • Transformer Models (e.g., BERT, GPT-3): Transformer models have become very popular in NLP for various tasks, including document similarity. These models can capture the context and semantics of words in a document more effectively than traditional methods. By fine-tuning pre-trained transformer models on specific tasks or datasets, one can achieve state-of-the-art results for document similarity. Using GPT-3 for document similarity requires access to the OpenAI API. Here's a basic example of how you might use GPT-3 to encode text and then calculate cosine similarity:

import openai
from sklearn.metrics.pairwise import cosine_similarity

# Set your OpenAI API key
openai.api_key = 'your-api-key'

# Sample documents
documents = ["The sky is blue", "The sun is bright"]

# Encode documents using GPT-3
embeddings = []
for doc in documents:
    response = openai.Embedding.create(input=doc, engine="text-similarity-babbage-001")
    embeddings.append(response['data'][0]['embedding'])

# Calculate cosine similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])
print("Cosine similarity (GPT-3):", similarity[0][0])
        

  • Neural Network-Based Clustering: Advanced clustering techniques based on neural networks can group similar documents together in high-dimensional space, allowing for an efficient way to measure document similarity within and across clusters.Implementing neural network-based clustering from scratch is quite complex. However, you can use pre-trained models and libraries like HDBSCAN for clustering. Here's a simplified example using sentence embeddings and HDBSCAN:

from sentence_transformers import SentenceTransformer
import hdbscan

# Load pre-trained BERT model for sentence embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample sentences
sentences = ["The sky is blue", "The sun is bright", "The ocean is deep"]

# Generate embeddings
embeddings = model.encode(sentences)

        

These modern methods often provide better results in terms of capturing semantic similarity, handling polysemy (words with multiple meanings), and dealing with different lengths of documents. However, they also require more computational resources and may involve more complex preprocessing and fine-tuning steps.


Conclusion

Document similarity is a fundamental concept in NLP with various applications. In this document, we explored three common methods for measuring document similarity: cosine similarity, Jaccard similarity, and Euclidean distance. Each method has its own advantages and use cases, and the choice of method depends on the specific requirements of the application. By implementing these methods in Python, we can effectively compare and analyze the similarity between documents.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了