Text Classification with Hugging Face's BERT Model in Langchain

Text Classification with Hugging Face's BERT Model in Langchain

Text classification is the process of categorizing natural language text into different categories based on its content. This technique is widely used in various applications like sentiment analysis, spam detection, topic modeling, and many others. In this LinkedIn tutorial, we will walk through building our own text classifier using Hugging Face's BERT (Bidirectional Encoder Representations from Transformers) model and AutoTokenizer in Python. We will also demonstrate how to interpret the classification results, specifically focusing on the probabilities associated with the positive and negative classes.

Notebook link: https://github.com/ArjunAranetaCodes/LangChain-Guides/blob/main/Text_Classification_using_bert_base_uncased.ipynb

Prerequisites: To follow along with this tutorial, ensure you have installed the following dependencies:

  • Python >= 3.7
  • Transformers library by Hugging Face (Install via pip: !pip install transformers)
  • PyTorch library (Install via pip: !pip install torch)

Let's dive right in!


Step 1: Load Pre-Trained BERT Model and Tokenizer

We first need to download and initialize the pre-trained BERT model and tokenizer. The tokenizer converts raw text inputs into numerical representations suitable for feeding into the deep learning model. Here's the code:

from transformers import BertForSequenceClassification, BertTokenizer

model_name = 'bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)        

Explanation:

The above code imports the necessary modules and initializes a BERT model and tokenizer using Hugging Face's Transformers library. We set model_name to 'bert-base-cased' which indicates we are using the base cased variant of BERT. We then create an instance of BertTokenizer using BertTokenizer.from_pretrained, passing the path to the pre-trained weights. Finally, we instantiate a sequence classification model called BertForSequenceClassification. Since we are performing binary classification, we pass num_labels=2.


Step 2: Create Function to Encode Input Text

Next, let's define a helper function encode_text that accepts raw text input and returns the predicted probabilities for positive and negative classes:

def encode_text(text):
    encoded_input = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    output = model(**encoded_input)
    logits = output.logits
    probs = torch.nn.functional.softmax(logits, dim=-1)
    return probs.detach().numpy()[0]        

Explanation:

Here, the encode_text function takes a string argument representing the raw text input. Firstly, we tokenize the input using the previously initialized tokenizer object. Then, we feed the tokenized input to the model and retrieve the logits. Afterward, we apply softmax activation over the last dimension of the logit tensor to convert them into probabilities. Lastly, we detach the resulting tensor from the computation graph and return the numpy array containing probabilities for both positive and negative classes.


Step 3: Make Predictions

Now, we can test the model's performance by providing our custom text input and interpreting the outputs:

text = "I had a wonderful day at work!"
probs = encode_text(text)

positive_prob = probs[1]
negative_prob = probs[0]

print(f"Probability of Positive Class: {positive_prob:.4f}")
print(f"Probability of Negative Class: {negative_prob:.4f}")        

Explanation:

Finally, we prepare our text input and call the encode_text function to fetch the predicted probabilities. Recall that the indices 0 and 1 correspond to the negative and positive classes respectively. Thus, we extract the respective values and display them separately. These numbers indicate the likelihood of the input being classified as either positive or negative.

Congratulations! You now understand the basics of building a text classifier using Hugging Face's BERT model and AutoTokenizer in Python. Additionally, you learned how to interpret the classification results, particularly the probabilities related to the positive and negative classes. To improve upon this foundation, consider exploring other NLP techniques, fine-tuning the model on larger datasets, and testing alternative architectures within the vast ecosystem of pre-trained models available via Hugging Face's Transformers library. Happy experimenting!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了