登录查看更多内容

Text Classification with Hugging Face's BERT Model in Langchain

Arjun Araneta

Backend Developer | Java Certified | Full Stack Developer

发布日期: 2024年3月25日

Text classification is the process of categorizing natural language text into different categories based on its content. This technique is widely used in various applications like sentiment analysis, spam detection, topic modeling, and many others. In this LinkedIn tutorial, we will walk through building our own text classifier using Hugging Face's BERT (Bidirectional Encoder Representations from Transformers) model and AutoTokenizer in Python. We will also demonstrate how to interpret the classification results, specifically focusing on the probabilities associated with the positive and negative classes.

Notebook link: https://github.com/ArjunAranetaCodes/LangChain-Guides/blob/main/Text_Classification_using_bert_base_uncased.ipynb

Prerequisites: To follow along with this tutorial, ensure you have installed the following dependencies:

Python >= 3.7
Transformers library by Hugging Face (Install via pip: !pip install transformers)
PyTorch library (Install via pip: !pip install torch)

Let's dive right in!

Step 1: Load Pre-Trained BERT Model and Tokenizer

We first need to download and initialize the pre-trained BERT model and tokenizer. The tokenizer converts raw text inputs into numerical representations suitable for feeding into the deep learning model. Here's the code:

from transformers import BertForSequenceClassification, BertTokenizer

model_name = 'bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Explanation:

The above code imports the necessary modules and initializes a BERT model and tokenizer using Hugging Face's Transformers library. We set model_name to 'bert-base-cased' which indicates we are using the base cased variant of BERT. We then create an instance of BertTokenizer using BertTokenizer.from_pretrained, passing the path to the pre-trained weights. Finally, we instantiate a sequence classification model called BertForSequenceClassification. Since we are performing binary classification, we pass num_labels=2.

Mohammed Karimkhan Pathan 2 年前

Machine Learning | Natural Language Preprocessing…

Mateusz Jurewicz, PhD 6 年前

Table Question Answering: Leveraging Pretrained Models…

Arjun Araneta 6 个月前

Step 2: Create Function to Encode Input Text

Next, let's define a helper function encode_text that accepts raw text input and returns the predicted probabilities for positive and negative classes:

def encode_text(text):
    encoded_input = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    output = model(**encoded_input)
    logits = output.logits
    probs = torch.nn.functional.softmax(logits, dim=-1)
    return probs.detach().numpy()[0]

Explanation:

Here, the encode_text function takes a string argument representing the raw text input. Firstly, we tokenize the input using the previously initialized tokenizer object. Then, we feed the tokenized input to the model and retrieve the logits. Afterward, we apply softmax activation over the last dimension of the logit tensor to convert them into probabilities. Lastly, we detach the resulting tensor from the computation graph and return the numpy array containing probabilities for both positive and negative classes.

Step 3: Make Predictions

Now, we can test the model's performance by providing our custom text input and interpreting the outputs:

text = "I had a wonderful day at work!"
probs = encode_text(text)

positive_prob = probs[1]
negative_prob = probs[0]

print(f"Probability of Positive Class: {positive_prob:.4f}")
print(f"Probability of Negative Class: {negative_prob:.4f}")

Explanation:

Finally, we prepare our text input and call the encode_text function to fetch the predicted probabilities. Recall that the indices 0 and 1 correspond to the negative and positive classes respectively. Thus, we extract the respective values and display them separately. These numbers indicate the likelihood of the input being classified as either positive or negative.

Congratulations! You now understand the basics of building a text classifier using Hugging Face's BERT model and AutoTokenizer in Python. Additionally, you learned how to interpret the classification results, particularly the probabilities related to the positive and negative classes. To improve upon this foundation, consider exploring other NLP techniques, fine-tuning the model on larger datasets, and testing alternative architectures within the vast ecosystem of pre-trained models available via Hugging Face's Transformers library. Happy experimenting!

要查看或添加评论，请登录

查看全部

Text Classification with Hugging Face's BERT Model in Langchain

Arjun Araneta

Backend Developer | Java Certified | Full Stack Developer

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Table Question Answering: Leveraging Pretrained Models & Python Libraries

This is simply adorable. Code-Davinci-002 GPT-3.5 fine tuned on code.

Part 2 Decoding the Diffusion Process: A Deep Dive into AI's Generative Power

Unleashing the Power of Words: A Journey into Creating Your Own Generative Language Model with TensorFlow and PyTorch in Jupyter Notebook

Use Cases of DSPy

Deep Learning Prerequisites: The Numpy Stack in Python (V2+)

Hand-Written Digit Classification

Make your own Reddit Dataset

Sampling Entity tagging with nltk, spaCy and CoreNLP using Flask

领英推荐

Java Stream Intermediate Operations

2024年6月26日

Collecting a Stream to an Immutable Collection in Java

2024年6月21日

Simple salary computations using Java 8 Lambda Expressions (Collectors.groupingBy)

2024年6月19日

Parallel Array Manipulation Using Java Streams

2024年6月15日

Integrating Image Related AI models using Streamlit, Python, and Replicate API (Kinda easy even for?me!)

2024年5月6日

Conversational AI with Langchain & Hugging Face: Building a Simple Chatbot Interface

2024年4月22日

"Fill Mask" Model and HuggingFace's Pipeline API

2024年4月18日

Feature Extraction with HuggingFace's Sentence Transformers

2024年4月15日

Text Summarization using Hugging Face's T5 Model

2024年4月13日

NLP Translation using Hugging Face's Transformers and SacreMoses libraries

2024年4月9日

社区洞察

其他会员也浏览了

Table Question Answering: Leveraging Pretrained Models & Python Libraries

This is simply adorable. Code-Davinci-002 GPT-3.5 fine tuned on code.

Part 2 Decoding the Diffusion Process: A Deep Dive into AI's Generative Power

Unleashing the Power of Words: A Journey into Creating Your Own Generative Language Model with TensorFlow and PyTorch in Jupyter Notebook

Use Cases of DSPy

Deep Learning Prerequisites: The Numpy Stack in Python (V2+)

Hand-Written Digit Classification

Make your own Reddit Dataset

Sampling Entity tagging with nltk, spaCy and CoreNLP using Flask