登录查看更多内容

Feature Extraction with HuggingFace's Sentence Transformers

Arjun Araneta

Backend Developer | Java Certified | Full Stack Developer

发布日期: 2024年4月15日

As machine learning practitioners, we often need to convert raw data into numerical features that can be fed into our models. This process, known as feature engineering or extraction, plays a crucial role in determining the performance and generalization ability of our algorithms. In natural language processing (NLP), one common approach to feature extraction is to represent text documents as dense vectors called embeddings. These embeddings capture semantic information about the input texts, allowing us to perform various downstream tasks such as search, clustering, classification, etc.

In this tutorial, I will show you how to use sentence-transformers, a powerful library built on top of Hugging Face's transformers, to extract high-quality sentence embeddings for feature extraction purposes. We will go through each line of the following code snippet and discuss its functionality:

Notebook link: https://github.com/ArjunAranetaCodes/LangChain-Guides/blob/main/Feature_Extraction_SentenceTransformer_mxbai_embed_large.ipynb

Code:

Tilix AI 1 年前

With document democracy, we go beyond traditional…

Mallesh Murugesan 3 年前

The Algorithmic Core of Positional Encoding in…

William Zebrowski 3 个月前

!pip install sentence-transformers

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

query = 'Represent this sentence for searching relevant passages: A man is eating a piece of bread'

docs = [
    query,
    "A man is eating food.",
    "A man is eating pasta.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
]

embeddings = model.encode(docs)

similarities = cos_sim(embeddings[0], embeddings[1:])

print('similarities:', similarities)

Explanation:

The first line installs the sentence-transformers package from PyPI. You only need to run it once to get access to all the functionalities provided by the library.
Next, we import two classes from sentence-transformers. The SentenceTransformer class provides an easy interface to generate sentence embeddings using pretrained BERT-like models. The cos_sim function computes the cosine similarity between two sets of embeddings.
To create an instance of SentenceTransformer, we load a pretrained model checkpoint called mixedbread-ai/mxbai-embed-large-v1. This model has been trained on large corpora of text and produces highly informative embeddings suitable for various NLP applications.
After defining a sample query string, we define a list of document strings, including the query itself and some related sentences. Our goal is to compute the pairwise similarity scores between the query embedding and those of the other documents.
Using the encode method of our SentenceTransformer object, we obtain dense vector representations (i.e., embeddings) for all the documents in our collection. Note that these embeddings are actually batches of fixed-size tensors representing individual sentences.
Finally, we calculate the cosine similarity matrix between the query embedding and every other document embedding using the previously imported cos_sim utility function. By printing out the resulting array, we observe which documents have higher affinity towards the query based on their semantic content.

By mastering feature extraction using sentence transformers, you can unlock exciting possibilities in your NLP projects, ranging from efficient retrieval systems to accurate sentiment analysis tools. Happy coding!

Feature Extraction with HuggingFace's Sentence Transformers

Arjun Araneta

Backend Developer | Java Certified | Full Stack Developer

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Learning Vectoral Representation Of Words

Count Vectorizer vs TFIDF Vectorizer | Natural Language Processing

Unveiling Portfolio Simulator Version 2 powered by NLP model

Evolution of NLP and Intro to Large Language Models

BERT MODEL- UNDERSTANDING

The ABCs of BERTopic: A Beginner's Guide

Engineers Guide to AI - Tokenization

BERT Model (On demand topic )

How to Use Prompt Engineering for Data Augmentation and Synthesis

领英推荐

Java Stream Intermediate Operations

2024年6月26日

Collecting a Stream to an Immutable Collection in Java

2024年6月21日

Simple salary computations using Java 8 Lambda Expressions (Collectors.groupingBy)

2024年6月19日

Parallel Array Manipulation Using Java Streams

2024年6月15日

Integrating Image Related AI models using Streamlit, Python, and Replicate API (Kinda easy even for?me!)

2024年5月6日

Conversational AI with Langchain & Hugging Face: Building a Simple Chatbot Interface

2024年4月22日

"Fill Mask" Model and HuggingFace's Pipeline API

2024年4月18日

Text Summarization using Hugging Face's T5 Model

2024年4月13日

NLP Translation using Hugging Face's Transformers and SacreMoses libraries

2024年4月9日

How to Perform Zero-Shot Text Classification Using Hugging Face Transformers Library in Python

2024年4月6日