登录查看更多内容

Natural language processing NLP implementation using the BERT Sentiment Analysis App

Jabo Justin

Technical Support Engineer at Micro Focus && at Tek-Experts (||Advanced Authentication ||Secure Login||Network Security Products Team),, ||Data Analyst|| Data Engineer|| (BI) Analyst|| Team Leader Manager At Azubi Africa

发布日期: 2023年10月29日

What is Sentiment Analysis?

Sentiment analysis A natural language processing technique called sentiment analysis can be used to ascertain the emotional undertone of a string of words, phrases, or sentences.

Sentiment analysis has been more and more common in a number of domains recently, including social media analysis, brand monitoring, and customer service.

Sentiment analysis tasks are now easier to do with pre-trained language models like BERT (Bidirectional Encoder Representations from Transformers).

This article will explore how to use Hugging Face to fine-tune a pre-trained BERT model for sentiment analysis and upload it to the Hugging Face model hub.

This project's whole codebase is accessible on GitHub

Why Hugging Face?

Hugging Face is a platform that offers an extensive collection of information and tools for tasks related to machine learning and natural language processing (NLP).

Data analysts, developers, and researchers can make use of its extensive collection of pre-trained models, datasets, and libraries, as well as its intuitive interface.

Hugging Face provides an extensive range of pre-trained models that have been trained on sizable datasets and are intended to carry out particular NLP tasks such machine translation, text classification, sentiment analysis, and named entity recognition.

By giving you a head start on your analysis, these models spare you the time and trouble of having to train models from scratch.

I recommend taking part in this course to gain a comprehensive understanding of natural language processing (NLP) through the utilization of Hugging Face ecosystem libraries for this project.

To use all of the platform's capabilities, please visit the website: go to the website and sign in

Using GPU Runtime on Google Colab

It's crucial to understand the advantages of utilizing GPU runtime on Google Colab before we dive into the code.

The term "GPU" refers to the Graphical Processing Unit, a potent piece of hardware used to handle complex graphics and computations.

Due to the Hugging face models' Deep Learning foundation, training them will require a large computational of GPU processing power.

Please use a local computer with an NVIDIA GPU, Colab , or another GPU cloud provider to complete the task.

We used Google Colab's GPU runtime in our project to speed up the training procedure.

All we have to do when creating a new notebook in Google Colab is choose the GPU runtime environment in order to access a GPU.

This enables us to make the most of the GPU's capabilities and finish our training tasks far more quickly.

Modifying GPU runtime

Setup

Now that we know how important GPUs are, let's get started with the coding. First, we install Hugging Face's transformers library, which is a Python-based library.

A variety of pre-trained models and methods for optimizing them are offered by this library. We'll install further prerequisites as well.

!pip install transformers
!pip install datasets
!pip install --upgrade accelerate
!pip install sentencepiece

After that, we load the dataset and import the required libraries. The dataset from the Zindi Challenge , which can be downloaded here. will be used in this project.

import huggingface_hub # Importing the huggingface_hub library for model sharing and versioning
import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from datasets import DatasetDict, Dataset

from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding# Load the dataset from a GitHub link
url = "https://raw.githubusercontent.com/ikoghoemmanuell/Sentiment-Analysis-with-Finetuned-Models/main/data/Train.csv"
df = pd.read_csv(url)# A way to eliminate rows containing NaN values
df = df[~df.isna().any(axis=1)]

We split the preprocessed data into training and validation sets after loading the dataset and removing NaN values.

A PyTorch dataset was also produced by us. For our machine-learning process, PyTorch datasets offer a consistent format that is more effective and simple to utilize.

We can guarantee consistency in our data handling and achieve easy integration with other PyTorch functions by adhering to this dataset format.

# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

领英推荐

A beginner’s introduction to Natural Language…

CloudMoyo 2 个月前

Large Language Models: A Comprehensive Survey of State…

Dhanraj Dadhich 1 年前

Mastering Natural Language Processing in 10 Minutes a…

Jyoti Dabass, Ph.D 1 个月前

# Create a pytorch dataset # Create a train and eval datasets using the specified columns from the DataFrame
train_dataset = Dataset.from_pandas(train[['tweet_id', 'safe_text', 'label', 'agreement']])
eval_dataset = Dataset.from_pandas(eval[['tweet_id', 'safe_text', 'label', 'agreement']])
# Combine the train and eval datasets into a DatasetDict
dataset = DatasetDict({'train': train_dataset, 'eval': eval_dataset})
# Remove the '__index_level_0__' column from the dataset
dataset = dataset.remove_columns('__index_level_0__')

Preprocessing

We then tokenize and clean the text data. Only numerical data is understood by machine learning models.

To generate word embeddings—numerical representations of text—tokenization is required.

Dense vector representations of words that capture their semantic meaning and relationships are called word embeddings.

These representations make it possible for machines to understand word similarities and contextual information, which makes higher-level NLP tasks easier.

Two frequently used functions for label transformation and text preprocessing will be employed in the preprocessing phase.

The Preprocess function alters content by inserting placeholders in place of usernames and links. Labels can be converted from dictionary format to a numerical representation using the Transform_Labels function.

Tokenization

checkpoint = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
# define the tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_data(example):
    return tokenizer(example['safe_text'], padding='max_length')
# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_data, batched=True)
# Transform labels and remove the useless columns
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

We specify the checkpoint variable, which contains the desired pre-trained model's name or identification. It's the "cardiffnlp/twitter-xlm-roberta-base-sentiment" model in this instance.
Tokenizer = AutoTokenizer.from_pretrained(checkpoint): The AutoTokenizer class from the transformers library is used to construct a tokenizer object. Text data must be transformed into numerical tokens that the model can comprehend using the tokenizer.
def tokenize_data(example): Tokenize_data is a function we define, and its argument is an example from the dataset. This function applies padding to ensure that every input has the same length by tokenizing the text in the example using the tokenizer.
dataset = dataset.map(tokenize_data, batched=True): Using the map technique, we apply the tokenize_data function to the full dataset. This effectively gets the text data in the 'safe_text' column ready for ingestion by the model by transforming it into tokenized representations. For optimal efficiency, the mapping process should be conducted in batches, as indicated by the batched=True argument.
remove_columns = [‘tweet_id’, ‘label’, ‘safe_text’, ‘agreement’]: The names of the columns we wish to delete from the dataset are listed in a list we generate called remove_columns.
dataset = dataset.map(transform_labels, remove_columns=remove_columns): We use the map method to add another alteration to the dataset. This time, we convert the labels in the dataset to numerical values by using the transform_labels function. Furthermore, the columns mentioned in the remove_columns list are eliminated, thereby eliminating them from the dataset.

To prepare the dataset for training or evaluation using the sentiment analysis model, we preprocess it by tokenizing the text data, changing the labels, and deleting extraneous columns.

Training

We may fine-tune the pre-trained model for sentiment analysis now that we have our preprocessed data. First, let's define and specify our training parameters.

# Configure the trianing parameters like `num_train_epochs`: 
# the number of time the model will repeat the training loop over the dataset
training_args = TrainingArguments("test_trainer", 
                                  num_train_epochs=10, 
                                  load_best_model_at_end=True, 
                                  save_strategy='epoch',
                                  evaluation_strategy='epoch',
                                  logging_strategy='epoch',
                                  logging_steps=100,
                                  per_device_train_batch_size=16,
                                  )

The number of epochs, batch size, and learning rate are among the hyperparameters that we set for the model's training.

After loading a previously trained model and rearranging and shuffle the data, we will specify the evaluation metric. We are use rmse in this instance.

# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)

train_dataset = dataset['train'].shuffle(seed=24) 
eval_dataset = dataset['eval'].shuffle(seed=24) def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"rmse": mean_squared_error(labels, predictions, squared=False)}

By initializing the Trainer object with the following parameters, we can quickly train and assess our model using the training and evaluation datasets that are provided.

It is easier for us to concentrate on model construction and analysis because the Trainer class takes care of the training loop, optimization, logging, and assessment.

trainer = Trainer(
    model,
    training_args, 
    train_dataset=train_dataset, 
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Lastly, we may use to train our model.

trainer.train()

And launch the final evaluation using.

trainer.evaluate()

Here is a basic example that just uses 10 fine-tuning epochs.

Notebook on nbviewer

Go here to learn more about the fine-tuning idea.

Next Steps

Are you unsure about your next steps? The next step would be to use Streamlit or Gradio, for example, to deploy your model.

This would be an online tool that your users could utilize to engage in prediction-making. Two web apps created using the recently improved model are shown in these screenshots.

Conclusion

In conclusion, utilizing the Hugging Face module, we have improved a pre-trained model for sentiment analysis on a dataset. After ten training epochs, the model obtained an rmse score of 0.7 on the validation set.

Here is the complete code for this project.

[Click Here]

Resources

A quick intro to NLP
Getting Started With Hugging Face in 15 Minutes
Fine-tuning a Neural Network explained
Fine-Tuning-DistilBert — Hugging Face Transformer for Poem Sentiment Prediction | NLP
Introduction to NLP: Playlist

要查看或添加评论，请登录

Jabo Justin的更多文章

DEVELOPING SEPSIS PREDICTION WEB APPLICATION, WITH THE HELP OF MACHINE LEARNING AND FASTAPI PROJECT. CATEGORIZATION.

2023年12月21日

DEVELOPING SEPSIS PREDICTION WEB APPLICATION, WITH THE HELP OF MACHINE LEARNING AND FASTAPI PROJECT. CATEGORIZATION.

1.0 What is Sepsis? The body’s reaction to an infection can harm its own tissues and organs, leading to the potentially…
Using the Streamlit application and machine learning project, forecast and analyze customer Churn, to lower the rate of leaving the organization.

2023年11月5日

Using the Streamlit application and machine learning project, forecast and analyze customer Churn, to lower the rate of leaving the organization.

1.0 Introduction (What is Customer Churn?) Customer churn: When clients or subscribers stop using a company or service,…
Sales Analysis and Forecasting Of The Grocery Stores Using the Gradio, Streamlit application and machine learning project.

2023年10月23日

Sales Analysis and Forecasting Of The Grocery Stores Using the Gradio, Streamlit application and machine learning project.

Sales Analysis and Forecasting Of The Grocery Stores Azubian, a firm with grocery stores all over Africa, has recently…
BERT Sentiment Analysis App based Natural Language Processing (NLP) implementation.

2023年7月30日

BERT Sentiment Analysis App based Natural Language Processing (NLP) implementation.

BERT Sentiment Analysis App BERT Sentiment Analysis App based Natural Language Processing (NLP) implementation…
Streamlit Machine Leaning app

2023年7月23日

Streamlit Machine Leaning app

Streamlit Machine Leaning app Description Streamlit: is an open-source Python library. with the aid of "Streamlit" it…
GRADIO APP FOR PREDICTING TELCO CUSTOMER CHURN

2023年7月10日

GRADIO APP FOR PREDICTING TELCO CUSTOMER CHURN

GRADIO APP FOR PREDICTING TELCO CUSTOMER CHURN Customer churn is a significant issue for many companies, particularly…
Predicting Telco Customer Churn and findings from data analysis: Using machine learning.

2023年6月9日

Predicting Telco Customer Churn and findings from data analysis: Using machine learning.

Introduction: For the purpose of analyzing customer attrition at a telecoms firm (Telco), this repository contains code…
Time Series Forecasting And Analysis Of Store Sales Of Corporation Favorita Products-Regression Findings.

2023年5月15日

Time Series Forecasting And Analysis Of Store Sales Of Corporation Favorita Products-Regression Findings.

DESCRIPTION This project aims to analyze and forecast the sales of a store based on the time series data from…
Analysis of Funding Trends in the Indian Startup Ecosystem

2023年4月8日

Analysis of Funding Trends in the Indian Startup Ecosystem

Analysis of Funding Trends in the Indian Startup Ecosystem Project Description The goal of this initiative is to…

4 条评论

See all articles