Natural language processing NLP implementation using the BERT Sentiment Analysis App
Jabo Justin
Technical Support Engineer at Micro Focus && at Tek-Experts (||Advanced Authentication ||Secure Login||Network Security Products Team),, ||Data Analyst|| Data Engineer|| (BI) Analyst|| Team Leader Manager At Azubi Africa
What is Sentiment Analysis?
Sentiment analysis A natural language processing technique called sentiment analysis can be used to ascertain the emotional undertone of a string of words, phrases, or sentences.
Sentiment analysis has been more and more common in a number of domains recently, including social media analysis, brand monitoring, and customer service.
Sentiment analysis tasks are now easier to do with pre-trained language models like BERT (Bidirectional Encoder Representations from Transformers).
This article will explore how to use Hugging Face to fine-tune a pre-trained BERT model for sentiment analysis and upload it to the Hugging Face model hub.
This project's whole codebase is accessible on GitHub
Why Hugging Face?
Hugging Face is a platform that offers an extensive collection of information and tools for tasks related to machine learning and natural language processing (NLP).
Data analysts, developers, and researchers can make use of its extensive collection of pre-trained models, datasets, and libraries, as well as its intuitive interface.
Hugging Face provides an extensive range of pre-trained models that have been trained on sizable datasets and are intended to carry out particular NLP tasks such machine translation, text classification, sentiment analysis, and named entity recognition.
By giving you a head start on your analysis, these models spare you the time and trouble of having to train models from scratch.
I recommend taking part in this course to gain a comprehensive understanding of natural language processing (NLP) through the utilization of Hugging Face ecosystem libraries for this project.
To use all of the platform's capabilities, please visit the website: go to the website and sign in
Using GPU Runtime on Google Colab
It's crucial to understand the advantages of utilizing GPU runtime on Google Colab before we dive into the code.
The term "GPU" refers to the Graphical Processing Unit, a potent piece of hardware used to handle complex graphics and computations.
Due to the Hugging face models' Deep Learning foundation, training them will require a large computational of GPU processing power.
Please use a local computer with an NVIDIA GPU, Colab , or another GPU cloud provider to complete the task.
We used Google Colab's GPU runtime in our project to speed up the training procedure.
All we have to do when creating a new notebook in Google Colab is choose the GPU runtime environment in order to access a GPU.
This enables us to make the most of the GPU's capabilities and finish our training tasks far more quickly.
Modifying GPU runtime
Setup
Now that we know how important GPUs are, let's get started with the coding. First, we install Hugging Face's transformers library, which is a Python-based library.
A variety of pre-trained models and methods for optimizing them are offered by this library. We'll install further prerequisites as well.
!pip install transformers
!pip install datasets
!pip install --upgrade accelerate
!pip install sentencepiece
After that, we load the dataset and import the required libraries. The dataset from the Zindi Challenge , which can be downloaded here. will be used in this project.
import huggingface_hub # Importing the huggingface_hub library for model sharing and versioning
import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from datasets import DatasetDict, Dataset
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding# Load the dataset from a GitHub link
url = "https://raw.githubusercontent.com/ikoghoemmanuell/Sentiment-Analysis-with-Finetuned-Models/main/data/Train.csv"
df = pd.read_csv(url)# A way to eliminate rows containing NaN values
df = df[~df.isna().any(axis=1)]
We split the preprocessed data into training and validation sets after loading the dataset and removing NaN values.
A PyTorch dataset was also produced by us. For our machine-learning process, PyTorch datasets offer a consistent format that is more effective and simple to utilize.
We can guarantee consistency in our data handling and achieve easy integration with other PyTorch functions by adhering to this dataset format.
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])
领英推荐
# Create a pytorch dataset # Create a train and eval datasets using the specified columns from the DataFrame
train_dataset = Dataset.from_pandas(train[['tweet_id', 'safe_text', 'label', 'agreement']])
eval_dataset = Dataset.from_pandas(eval[['tweet_id', 'safe_text', 'label', 'agreement']])
# Combine the train and eval datasets into a DatasetDict
dataset = DatasetDict({'train': train_dataset, 'eval': eval_dataset})
# Remove the '__index_level_0__' column from the dataset
dataset = dataset.remove_columns('__index_level_0__')
Preprocessing
We then tokenize and clean the text data. Only numerical data is understood by machine learning models.
To generate word embeddings—numerical representations of text—tokenization is required.
Dense vector representations of words that capture their semantic meaning and relationships are called word embeddings.
These representations make it possible for machines to understand word similarities and contextual information, which makes higher-level NLP tasks easier.
Two frequently used functions for label transformation and text preprocessing will be employed in the preprocessing phase.
Tokenization
checkpoint = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
# define the tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_data(example):
return tokenizer(example['safe_text'], padding='max_length')
# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_data, batched=True)
# Transform labels and remove the useless columns
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)
To prepare the dataset for training or evaluation using the sentiment analysis model, we preprocess it by tokenizing the text data, changing the labels, and deleting extraneous columns.
Training
We may fine-tune the pre-trained model for sentiment analysis now that we have our preprocessed data. First, let's define and specify our training parameters.
# Configure the trianing parameters like `num_train_epochs`:
# the number of time the model will repeat the training loop over the dataset
training_args = TrainingArguments("test_trainer",
num_train_epochs=10,
load_best_model_at_end=True,
save_strategy='epoch',
evaluation_strategy='epoch',
logging_strategy='epoch',
logging_steps=100,
per_device_train_batch_size=16,
)
The number of epochs, batch size, and learning rate are among the hyperparameters that we set for the model's training.
After loading a previously trained model and rearranging and shuffle the data, we will specify the evaluation metric. We are use rmse in this instance.
# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)
train_dataset = dataset['train'].shuffle(seed=24)
eval_dataset = dataset['eval'].shuffle(seed=24) def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return {"rmse": mean_squared_error(labels, predictions, squared=False)}
By initializing the Trainer object with the following parameters, we can quickly train and assess our model using the training and evaluation datasets that are provided.
It is easier for us to concentrate on model construction and analysis because the Trainer class takes care of the training loop, optimization, logging, and assessment.
trainer = Trainer(
model,
training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
Lastly, we may use to train our model.
trainer.train()
And launch the final evaluation using.
trainer.evaluate()
Here is a basic example that just uses 10 fine-tuning epochs.
Go here to learn more about the fine-tuning idea.
Next Steps
Are you unsure about your next steps? The next step would be to use Streamlit or Gradio, for example, to deploy your model.
This would be an online tool that your users could utilize to engage in prediction-making. Two web apps created using the recently improved model are shown in these screenshots.
Conclusion
In conclusion, utilizing the Hugging Face module, we have improved a pre-trained model for sentiment analysis on a dataset. After ten training epochs, the model obtained an rmse score of 0.7 on the validation set.
Here is the complete code for this project.
Resources