Feature Extraction with HuggingFace's Sentence Transformers

Feature Extraction with HuggingFace's Sentence Transformers

As machine learning practitioners, we often need to convert raw data into numerical features that can be fed into our models. This process, known as feature engineering or extraction, plays a crucial role in determining the performance and generalization ability of our algorithms. In natural language processing (NLP), one common approach to feature extraction is to represent text documents as dense vectors called embeddings. These embeddings capture semantic information about the input texts, allowing us to perform various downstream tasks such as search, clustering, classification, etc.

In this tutorial, I will show you how to use sentence-transformers, a powerful library built on top of Hugging Face's transformers, to extract high-quality sentence embeddings for feature extraction purposes. We will go through each line of the following code snippet and discuss its functionality:

Notebook link: https://github.com/ArjunAranetaCodes/LangChain-Guides/blob/main/Feature_Extraction_SentenceTransformer_mxbai_embed_large.ipynb

Code:

!pip install sentence-transformers

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

query = 'Represent this sentence for searching relevant passages: A man is eating a piece of bread'

docs = [
    query,
    "A man is eating food.",
    "A man is eating pasta.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
]

embeddings = model.encode(docs)

similarities = cos_sim(embeddings[0], embeddings[1:])

print('similarities:', similarities)        

Explanation:

  • The first line installs the sentence-transformers package from PyPI. You only need to run it once to get access to all the functionalities provided by the library.
  • Next, we import two classes from sentence-transformers. The SentenceTransformer class provides an easy interface to generate sentence embeddings using pretrained BERT-like models. The cos_sim function computes the cosine similarity between two sets of embeddings.
  • To create an instance of SentenceTransformer, we load a pretrained model checkpoint called mixedbread-ai/mxbai-embed-large-v1. This model has been trained on large corpora of text and produces highly informative embeddings suitable for various NLP applications.
  • After defining a sample query string, we define a list of document strings, including the query itself and some related sentences. Our goal is to compute the pairwise similarity scores between the query embedding and those of the other documents.
  • Using the encode method of our SentenceTransformer object, we obtain dense vector representations (i.e., embeddings) for all the documents in our collection. Note that these embeddings are actually batches of fixed-size tensors representing individual sentences.
  • Finally, we calculate the cosine similarity matrix between the query embedding and every other document embedding using the previously imported cos_sim utility function. By printing out the resulting array, we observe which documents have higher affinity towards the query based on their semantic content.

By mastering feature extraction using sentence transformers, you can unlock exciting possibilities in your NLP projects, ranging from efficient retrieval systems to accurate sentiment analysis tools. Happy coding!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了