Battle of the Transformers: Fine-tune BERT for State-of-the-art sentiment Analysis Using Hugging Face
Courtlin Holt-Nguyen
Head of Data @ QIMA - AI, BI, Data Engineering and Smart Productivity | ex- Head of Enterprise Analytics for a Fortune 500 FMCG company in Vietnam | Data Strategy, Analytics, ML, Data Scientist
What is a Transformer?
In the context of machine learning and NLP, a transformer is a deep learning model introduced in a paper titled “Attention is All You Need” by Vaswani et al. in 2017. The model was proposed as a way to improve the performance of translation systems.
The name “transformer” stems from its ability to transform one sequence (input text) into another sequence (output text) while incorporating the context of the input sequence at multiple levels. It was a groundbreaking model because it introduced the concept of ‘attention’, which allows the model to focus on relevant parts of the input sequence when producing the output.
A Transformer model is composed of an encoder to read the text input and a decoder to produce a prediction for the task. The magic of Transformers is that they manage to maintain a high level of accuracy even for very long input sequences.
One of the most significant advantages of Transformers over its RNN (Recurrent Neural Network) and CNN (Convolutional Neural Network) counterparts is its ability to process the entire text input concurrently, whereas RNNs and CNNs process the input sequentially, which can be time-consuming. This concurrent processing ability allows Transformers to easily detect long-distance dependencies in the text data, making them highly effective for NLP tasks.
Transformers are the foundation of many modern NLP models, including BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pretrained Transformer), and others that we will discuss later in this paper. These models have broken the old records for a variety of NLP tasks, including text classification, sentiment analysis, and language translation. Keep reading to learn about the major Transformers that have been developed.
What are the different types of transformer models?
BERT Transformer
Bidirectional Encoder Representations from Transformers, better known as BERT, is a transformer-based machine learning model for natural language processing (NLP) that was developed and open-sourced by Google.
BERT considers the full context of a word by looking at the words that come before and after it, which is called bidirectional training. This is very different from previous NLP models, which primarily operated in a unidirectional manner or independently considered each word in a sentence.
The BERT architecture consists of a stack of transformers, its design enables it to understand the meaning of a word based on its context within a sentence, which is important when dealing with ambiguous language.
BERT’s training process involves masking words in a sentence, for which the model then predicts the original value. The model is also trained to predict the next sentence, which allows it to understand sentence relations better.
BERT can be used for various NLP tasks, including text classification, named entity recognition, question answering, and sentiment analysis. BERT serves as a base model for other, more complex models such as RoBERTa and DistilBERT.
Competing Transformers
Other popular transformers include RoBERTa, DistilBERT, XLNet and ELECTRA, all of which I will cover in the following article in this series.
Fine-tuning a Pre-Trained NLP Model for Sentiment Analysis With a Labeled Dataset
Transformer models are general-purpose language models trained on a large amount of text data. The models learn to understand the structure of language and can generate or process text in a way that makes sense. However, it doesn’t inherently understand the specific task you want it to perform, such as sentiment analysis, named entity recognition or question answering. This is where fine-tuning comes in.
Fine-tuning is a process where the model is trained on a smaller, task-specific dataset, after pre-training, to adapt the model’s knowledge to a specific task. In the case of sentiment analysis, the model needs to understand not just the text, but the sentiment behind it, which can be highly context-specific and nuanced.
By fine-tuning BERT on a sentiment analysis task, you’re adapting the model to understand these nuances and make accurate predictions about sentiment. Without fine-tuning, BERT would simply understand the language but would struggle to correctly identify and classify sentiments. Fine-tuning essentially “specializes” the model for a given task, leveraging the general-purpose language understanding capabilities of a Transformer to excel at that specific task.
Think of it as a general medical practitioner (BERT) who then goes on to specialize in cardiology (sentiment analysis) — they can’t become a cardiologist without additional, specific training, even though their general medical training forms a strong base for their specialization.
What is HuggingFace?
Hugging Face is a technology company that has made substantial contributions to the field of Natural Language Processing (NLP) and Machine Learning, particularly with its popular open-source library called Transformers, which has revolutionized NLP research and development. This library provides thousands of pre-trained models to the public, enabling anyone to leverage sophisticated machine learning models with just a few lines of code. These models include BERT, GPT-2, GPT-3, RoBERTa, and many others, each of which can be fine-tuned for specific tasks such as text classification, sentiment analysis and question-answering.
Fine-Tuning BERT Model for Sentiment Analysis using the Hugging Face Transformers Library
BERT Model for Sentiment Analysis
In the following code tutorial, we will fine-tune a state-of-the-art BERT model to perform sentiment analysis of movie reviews.
The Dataset for Fine-Tuning
The dataset I will be using for this experiment is the IMDB Large Movie Review Dataset of 50,000 labeled reviews provided by Andrew Maas et al. at Stanford in their paper, Learning Word Vectors for Sentiment Analysis. Here’s the link to the dataset website. You can download the full dataset with 50,000 labeled movie reviews from there.
The validation dataset (IMDB_Dataset_VALIDATE.csv) is a subset of this training set consisting of 200 unseen reviews.
领英推荐
WARNING: I strongly recommend using a machine with a good GPU when fine-tuning a transformer model. If you try to use a machine with only a CPU (like the standard Google Colab runtime), training a transformer will take forever and will almost certainly crash unless you have a high RAM (25+ GB) machine. For this experiment, I’m using Google Colab Pro with a high RAM machine (24 GB) and a V100 Nvidia GPU (16 GB). Using the V100 GPU on Google Colab, fine-tuning BERT only took 167 seconds. However, using only the CPU, the fine-tuning process was still running after 2 hours ? If you’re going to be doing NLP work with transformers, it makes sense to upgrade to Google Colab Pro.
Ok, let’s get into the code. To start with, let’s install the required packages and load the necessary libraries for this experiment.
Setup
We will be using Pytorch to tune our base BERT model and the Hugging Face transformers library so we need to install those in Google Colab first.
pip install torch transformers
import torch
from sklearn.metrics import accuracy_score, f1_score
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import pandas as pd
Now, load the dataset. I am using a file named IMDB_Dataset_TRAIN_only2K.csv that consists of one column with the review (Reviews) and the second column with the sentiment classification label (Sentiments).
# Load your DataFrame
df = pd.read_csv('/content/IMDB_Dataset_TRAIN_only2K.csv')
df.info()
# The dataset has positive and negative labels so we need to convert them to 0 and 1 values.
df['Sentiments'] = df['Sentiments'].replace({'positive': 1, 'negative': 0})
texts = df['Reviews'].tolist() # extract the reviews
y_true = df['Sentiments'].tolist() # extract the actual sentiments
Baseline Performance
Establish a baseline for BERT’s sentiment analysis capbilities out-of-the-box, i.e. without any fine-tuning.
from transformers import BertTokenizerFast, BertForSequenceClassification
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
import pandas as pd
import torch
# Load the model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model = model.to('cuda') # if GPU is available
# Load the model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model = model.to('cuda') # if GPU is available
# Load validation data
val_data = pd.read_csv('/content/IMDB_Dataset_VALIDATE.csv')
val_texts = val_data['Reviews'].tolist()
val_labels = val_data['Sentiments'].map({'positive': 1, 'negative': 0}).tolist() # convert sentiment to numeric
val_data.info()
# Initialize tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
# Tokenize data
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=512)
# Create torch dataset for validation
class ReviewDataset(Dataset):
def __init__(self, encodings, labels=None):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
if self.labels:
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.encodings['input_ids'])
val_dataset = ReviewDataset(val_encodings, val_labels)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
# Predict with the model
model.eval()
predictions = []
true_labels = []
for batch in val_loader:
input_ids = batch['input_ids'].to('cuda')
attention_mask = batch['attention_mask'].to('cuda')
labels = batch['labels'].to('cuda')
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
predicted_labels = torch.argmax(logits, dim=1).cpu().numpy()
predictions.extend(predicted_labels)
true_labels.extend(labels.cpu().numpy())
# Calculate metrics
accuracy = accuracy_score(true_labels, predictions)
f1 = f1_score(true_labels, predictions)
conf_matrix = confusion_matrix(true_labels, predictions)
print(f'Accuracy: {accuracy}')
print(f'F1-score: {f1}')
print(f'Confusion matrix:\n {conf_matrix}')
As expected, without fine-tuning, BERT’s performance is terrible — 49.5% accuracy for a binary classification task. The BERT base model understands the structure of human language but has not been specifically taught how to perform sentiment analysis. Despite a valiant effort, it fails miserably.
Train the Model (i.e. Fine-Tune the Pre-Trained Model) for Sentiment Analysis
Now, let’s use 2,000 labeled training examples to teach BERT how to perform sentiment analysis on movie reviews (i.e. fine-tune it). This CSV fine-tuning file is named IMDB_Dataset_TRAIN_only2K.csv’ and consists of 2 columns (Reviews) and (Sentiments).
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizerFast, BertForSequenceClassification#, AdamW
from sklearn.model_selection import train_test_split
import pandas as pd
from torch.optim import AdamW
import time
# Record start time
start_time = time.time()
# Load data
data = pd.read_csv('/content/IMDB_Dataset_TRAIN_only2K.csv')
data['Sentiments'] = data['Sentiments'].map({'positive': 1, 'negative': 0})
reviews = data['Reviews'].tolist()
labels = data['Sentiments'].tolist() # assuming sentiment is encoded as 0 (negative) and 1 (positive)
# Split data into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(reviews, labels, test_size=0.2)
# Initialize tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
# Tokenize data
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=512)
# Create torch dataset
class ReviewDataset(Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
# Create dataloaders
train_dataset = ReviewDataset(train_encodings, train_labels)
val_dataset = ReviewDataset(val_encodings, val_labels)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
# Initialize model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model = model.to('cuda') # if GPU is available
# Initialize optimizer
optimizer = AdamW(model.parameters(), lr=1e-5)
# Training loop
for epoch in range(3): # number of epochs
model.train()
for batch in train_loader:
optimizer.zero_grad()
input_ids = batch['input_ids'].to('cuda')
attention_mask = batch['attention_mask'].to('cuda')
labels = batch['labels'].to('cuda')
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
# Save the model
model.save_pretrained('sentiment_model_BERT')
# Record end time
end_time = time.time()
print("Time required to fine-tune: ", end_time - start_time)
Using a powerful GPU from Google Colab (the Nvidia V100), BERT’s fine-tuning was completed in 137 seconds.
Evaluate the Classification Accuracy of the BERT Model
Let’s see the effect of our fine-tuning on BERT’s ability to classify movie sentiment. Here’s the code to evaluate BERT’s performance using Accuracy and F1-Score.
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from transformers import BertTokenizerFast, BertForSequenceClassification
from torch.utils.data import DataLoader
import pandas as pd
import torch
# Load the model
model = BertForSequenceClassification.from_pretrained('sentiment_model_BERT')
model = model.to('cuda') # if GPU is available
# Load validation data
val_data = pd.read_csv('/content/IMDB_Dataset_VALIDATE.csv')
val_texts = val_data['Reviews'].tolist()
val_labels = val_data['Sentiments'].map({'positive': 1, 'negative': 0}).tolist() # convert sentiment to numeric
# Initialize tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
# Tokenize data
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=512)
# Create torch dataset for validation
class ReviewDataset(Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
val_dataset = ReviewDataset(val_encodings, val_labels)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
# Evaluate the model
model.eval()
predictions = []
true_labels = []
for batch in val_loader:
input_ids = batch['input_ids'].to('cuda')
attention_mask = batch['attention_mask'].to('cuda')
labels = batch['labels'].to('cuda')
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
predicted_labels = torch.argmax(logits, dim=1).cpu().numpy()
predictions.extend(predicted_labels)
true_labels.extend(labels.cpu().numpy())
# Calculate metrics
accuracy = accuracy_score(true_labels, predictions)
f1 = f1_score(true_labels, predictions)
conf_matrix = confusion_matrix(true_labels, predictions)
print(f'Accuracy: {accuracy}')
print(f'F1-score: {f1}')
print(f'Confusion matrix:\n {conf_matrix}')
And here’s the output:
Wow, the performance improvement is impressive for the newly trained model! With a relatively small, labeled training dataset, BERT is able to correctly classify 91.5% of the reviews in our unseen validation dataset.
Note: After 1,000 training examples, the incremental accuracy improvement was marginal. The increase in accuracy from 1,000 to 2,000 training examples in my tests was only 1.5 percentage points better.
Increasing the number of fine-tuning examples from 1,000 to 2,000 doubled the training time required but only resulted in a minuscule improvement in model accuracy on the unseen holdout validation dataset. Furthermore, after only 500 examples for fine-tuning, BERT and most of the other models will learn enough about sentiment analysis to reach impressive levels of accuracy (80%–90%+).
Conclusion
And there you have it. You now understand what a BERT transformer model is and how to fine-tune it for a natural language processing task using Python. Although the base model of BERT performs terribly on a sentiment analysis task, a little fine-tuning is all that’s required to supercharge its performance. As long as you have access to a decent GPU, the process should only take a few minutes to achieve state-of-the-art levels of sentiment analysis accuracy, which would have been unthinkable just a few years ago.