FineTuning BERT- Named Entity Recognition - Bidirectional Encoders Representation of Transformers - Part 4
In this article we will delve into the implementation of fine-tuning BERT with PyTorch through code.
Before you move ahead it is advisable to read Fundamentals of BERT - Part 1 , Fundamentals of BERT - Part 2 to get an idea of the core principles of BERT. Further, you can read Building BERT from Scratch - Part 3 to get an idea about the pre-training phase.
In order to fine-tune BERT we will try to solve a problem- NAMED ENTITY RECOGNITION.
The objective of Named Entity Recognition (NER) is to identify and categorize named entities within a given sequence. These named entities fall into pre-established categories that are selected based on the specific use case, including but not limited to names of individuals, organizations, locations, codes, temporal expressions, and monetary amounts. In essence, NER seeks to designate a class to each token, which is typically a single word, within the sequence. Consequently, NER is often described as a form of token classification.
To Fine-tune BERT for the task of Named Entity Recognition (NER), we will adopt a systematic approach that leverages the strengths of the BERT architecture while tailoring it to effectively identify and classify named entities within text. Named Entity Recognition is a crucial component of natural language processing, as it involves the identification of proper nouns and specific entities such as names of people, organizations, locations, dates, and other relevant terms within a given text.
To achieve this, we will utilize a Kaggle dataset that has been specifically curated for NER tasks. This dataset typically contains annotated text samples where named entities are labeled according to their respective categories. By employing this dataset, we can ensure that our model is trained on high-quality, relevant examples that reflect the complexities and nuances of real-world language use.
We will use KAGGLE DATASET - Name Entity Recognition (NER) Dataset
Download Link - https://www.kaggle.com/datasets/debasisdotcom/name-entity-recognition-ner-dataset?select=NER+dataset.csv
About Kaggle Dataset
Context
This is a very clean dataset and is for anyone who wants to try his/her hand on the NER ( Named Entity recognition ) task of NLP.
Content
The dataset with 1M x 4 dimensions contains columns = ['# Sentence', 'Word', 'POS', 'Tag'] and is grouped by #Sentence.
Columns
Word: This column contains English dictionary words form the sentence it is taken from.
POS: Parts of speech tag
Tag: Standard named entity recognition tags as follows:
'O', 'B-geo', 'B-gpe', 'B-per', 'I-geo', 'B-org', 'I-org', 'B-tim', 'B-art', 'I-art', 'I-per', 'I-gpe', 'I-tim', 'B-nat', 'B-eve', 'I-eve', 'I-nat'
There are 8 category tags, each with a "beginning" (B) and "inside" (I) variant, and the "outside" (O) tag.
It is not really clear what these tags mean - "geo" probably stands for geographical entity, "gpe" for geopolitical entity, , "per" stands for person, "tim" stands for time and so on.
Example
Consider the following sample sentence:
Samsung was founded in 1938 by Lee Byung-chul as a trading company.
And mapping of NER Tag associated with each word is as follows
Note: We only need Sentence, Word and Tag columns for Named Entity Recognition
What is NER?
NER, or Named Entity Recognition, is a component of Natural Language Processing (NLP) that focuses on identifying and categorizing named entities within a given text.
What it does?
NER locates and classifies named entities in text, such as people, places, organizations, time expressions, and quantities.
How it works?
Named Entity Recognition (NER) detects specific words or phrases within a document and subsequently categorizes them into established classifications.
Why it's useful?
Named Entity Recognition (NER) enhances document translation by offering contextual insights that aid in comprehending sentences. For instance, in the sentence "Apple released a new iPhone today," NER can recognize "Apple" as an organization.
Named Entity Recognition (NER) can be utilized in various practical scenarios where the analysis of extensive text data proves beneficial. Some applications of NER include:
Modeling Strategy
In the code, we will utilize BertForTokenClassification, a model provided by the Transformers library from HuggingFace. This model is built on the BERT architecture and features a token classification head, enabling it to perform predictions at the token level instead of the sequence level.
Named entity recognition is generally approached as a token classification task, which is the purpose of our implementation
This article explores the concept of transfer learning, which involves initially pre-training a large neural network in an unsupervised manner, followed by fine-tuning the network for a specific task. In this instance, BERT serves as the pre-trained neural network, having been trained on two tasks: masked language modeling and next sentence prediction.
Fine-tuning involves supervised learning, which indicates that a labeled dataset is required.
Note:
In summary, this article is intended for individuals who are already well-versed in these foundational topics, as it will build upon this knowledge to explore more advanced concepts and applications related to deep learning, BERT, and the use of PyTorch for implementing various models and techniques.
This article is divided into following sections:
A. Setting Environment
B. Set the Device (cpu or mps)
C. Reading Dataset
D. Pre-Processing
E. Explore Distribution of Labels
F. Stratified Splitting
G. Explore Distribution of Labels - Training, Validation and Testing Dataset
H. Modeling (includes training Loop)
I. Evaluation
J. Testing
K. Helper Function to make clean Predictions
L. Inference
I. Saving and Loading Utilities for Model
A. Setting up the environment and importing Python libraries
Please install the following libraries to set up the environment:
B. Set Device - Check if MPS (Metal Performance Shaders) is available
PyTorch uses the new Metal Performance Shaders (MPS) backend for GPU training acceleration. This MPS backend extends the PyTorch framework, providing scripts and capabilities to set up and run operations on Mac.
C. Reading Dataset
Load the NER dataset provided from a specified path into a pandas dataframe.
Peek into raw data:
Let us now explore the pre-processing steps.
D. Preprocessing Data
1. Replacing the NaN values by 'Sentence #'
The sentence number column denoted as 'Sentence #' has finite values only at the beginning of the sentence and rest values are NaN. We will assign the sentence number to every word by forward fill method.
Peek into transformed dataset.
1.1 Check the shape
2. Drop the unwanted 'POS' column
We don't need POS column hence we will drop it.
3. Remove all rows that contain NaN
We need to remove all rows which has a NaN entry.
4. Verify if there are any NaN values in Word and Tag Columns
Let us verify that we have successfully removed all NaN entries.
4.1 Tag columns
4.2 Word columns
5. Checking Improper Words
We need to remove all those words which has escaped characters like \X94, \X85 etc.
These words appear as blank but still they have tag mapped so they can produce a mapping problem while tokenization.
5.1 Helper Function
Functions to check if there are any escaped unicode characters. We will remove all the words with length less than or equal to 2 and has escaped unicode character.
5.2 Create a bool column - 'is_proper_word'
5.3 Check the data which has improper words
5.4 Retain the data which has only proper words
5.5 Drop the newly created column "is_proper_word"
6. Frequency of each tag
Compute the frequency of each tag.
7. Count the frequencies of sub-tags in decreasing order
We have sub-tags like geo, org, eve, per e.t.c. Lets compute the frequency of each tag sub-tag in decreasing order.
8. Dealing with Tags that could potentially act as Noise
From previous section, we can see that sub-tags - art, eve and nat are relatively very low in number.Hence, Tags - "B-art", "I-art", "B-eve", "I-eve", "B-nat", "I-nat" could potentially act as noise. Also note that Tag I-gpe is also relatively very low in count and hence it could also act as noise. Therefore we will replace all such tags that have the potential to act as noise.
8.1. Convert such subtags to "O" which is most frequent
For every tag that is removed, we will substitute it with "O" to keep the interaction information of the associated word with the other words.
8.2 Compute updated frequency of each tags
9. Compute label Maps
Create two maps:
10. Create complete sentences with their labels as one string
10.1 Drop duplicate sentences + labels and reset index
11. Remove the sentences that contain only O tags and don't contain any other target labels
12. Create additional columns to get an idea of sentence length, bert tokens length
First we will set the tokenizer to be used which is a Bert-Base-Uncased tokenizer. We will try to get an idea of the number of words in sentence, number of tokens produce for every sentence after tokenization and the differences between them.
BERT uses Word-piece algorithm to tokenize a sentence. WordPiece is a subword-based tokenization algorithm and so it can split a word into multiple tokens. Hence number of tokens will always be greater than or equal to number of words for every sentence.
peek into data
13. Lets see the Histograms of tokenized sentence length and difference of length
13.1. Histogram of difference in sentence length and label length (# of label)
13.2. Histogram of length of tokenized sentence (# of tokens created by tokenizer)
13.3 Histogram of difference in length between sentence length and bert tokenized sentence length
14. Setting meaningful Max Len
We need to set the maximum sequence length so that every sequence have equal length.
14.1. Compute the number of sentences where tokenized length is greater than a certain threshold
Conclusion:
So we can safely keep the max sequence length or max length to be 128
14.2 Dropping the unwanted columns and resetting the index
14.3. Let's verify that a random sentence and its corresponding tags are correct:
Peek into data
E. Explore distributions of Labels
This section holds significant importance. Given that our dataset is characterized by both multi-label attributes and an imbalance, it is crucial for us to understand the distribution of each label.
1. Create a new dataframe to explore distributions of labels
2. Set the label column to contain list of labels and "s_id" column as index
3. Explode the labels column to have one only one label per row
4. Create the dummies for labels column
领英推荐
5. Count the frequency of each label for every sentence
6. Descriptive Statistics of each label
7. Plot the kernel density plots for each label
8. Frequency of each label
F. Stratified (or Balanced) Split
In statistics, stratified sampling is a method used to ensure that specific subgroups within a population are adequately represented in a sample. This technique is particularly useful when the population is heterogeneous, meaning it consists of diverse groups that may have different characteristics or behaviors. By dividing the population into distinct subpopulations, known as strata, researchers can obtain more accurate and reliable estimates of the overall population parameters.
1. Helper Functions to Create Balanced Split function
2. Splitting the Dataset into training, validation and test dataset
The dataset will be divided into training, validation, and testing subsets while preserving the distribution of labels through stratified sampling.
The initial step involves separating the data into two categories: seen and unseen, with the unseen data designated as the testing set. Subsequently, the seen data will be further divided into training and validation sets.
It is important to consider the seen data as the information that the model is exposed to prior to its deployment in a production environment.
G. Explore Distribution of Labels - Training, Validation and Testing Data
We should confirm whether the data has been divided in a stratified manner. To do so we will compute the distribution of each label for every dataset.
Important- We should also confirm that the distribution of combination of labels for each dataset is also maintained.
1. Check distributions of 1st order
Here 1st order means that we are only looking at a single label and not the comabinations.
Here we can see that distribution is almost same for each label in all the three datasets
2. Check distributions of 2nd order
Let us now see the distribution of two labels combined.
Here (0,3) means label pair (O, B-Per). It has roughly the same value 0.33 for training, testing and validation dataset. The value 0.33 implies that 33% of the sentences in all the dataset has at-least one O and one B-Per label together.
Similarly (0,0) i.e. (O,O) has entry 0.9999 (approx. 1) for training data, 1.0 for validation and 1.0 implies that all the sentences have at-least two O labels.
2. Check distributions of 3rd order
Here we are considering three labels at a time.
(0, 3, 3) is (O, B-Per, B-Per) and has roughly same value 0.33 for each training, validation and testing dataset. The value 0.33 implies that 33% of sentences in each dataset contains at-least one O and two B-Per labels.
H. Modelling
Let us now set the parameters to build the model.
1. Modelling Parameters
2. Preparing the dataset and data-loader
WordPiece tokenization may divide a word into several tokens; therefore, we will assign the same label to each of these tokens as that of the original word splitted.
3. Create Model
We present a novel language representation model known as BERT, which stands for Bidirectional Encoder Representations from Transformers. In contrast to other recent models, BERT is specifically engineered to pre-train deep bidirectional representations from unlabeled text by simultaneously considering both left and right contexts across all layers. Consequently, the pre-trained BERT model can be easily fine-tuned by adding a single output layer, enabling the development of cutting-edge models for various tasks, including question answering and language inference, without the need for significant alterations to the task-specific architecture.
4. Set the Optimizer
Optimizer - AdamW:
AdamW is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update.
Learning Rate:
Learning rate is a tuning parameter in machine learning and statistics that controls how much a model's parameters adjust during each iteration of an optimization algorithm. It's a floating point number that's usually between 0.01 and 0.1
Adam Epsilon:
The parameter epsilon shows up in the update step.
It is primarily used as a guard against a zero second second moment causing a division by zero case. If it is too large it will bias the moment estimation.
5. Set the Scheduler
A scheduler is to make learning rate adaptive to the gradient descent optimization procedure, so you can increase performance and reduce training time.
In PyTorch, a model is updated by an optimizer and learning rate is a parameter of the optimizer.
Learning rate schedule is an algorithm to update the learning rate in an optimizer.
6. Create the training loop
A typical training loop in PyTorch iterates over the batches for a given number of epochs.
Since we are using learning scheduler for the adaptive learning rate, we call scheduler.step() to update the learning rate (as per the learning scheduler) after every epoch.
7. Training Phase
Let us now train the model.
8. Draw the graph for training loss to see its progress after every epoch
I. Evaluation
Let us evaluate the performance of model on validation data.
1. Create the Evaluation Function
2. Check Performance of model on Validation Dataset
3. Compute the classification report
4. Compute Classification report at label level
5. Confusion Matrix
----------------------------- Model Training and Evaluation Ends Here ---------------------
J. Testing - Test Dataset
Let us now test the performance of model on Testing Dataset.
1. Check Performance of Model on Testing Dataset
Note: Here Validation loss is actually testing loss
3. Compute FULL classification report
4. LABEL-WISE Classification report
5. Confusion Matrix
-------------------------------- Testing Phase Ends Here ----------------------------------
K. Helper Functions to make predictions
In order to make clean predictions for unseen sentences, we will create the following functions:
1. create_input_ids - Function to create Input Ids for a given sentence
2. make_raw_prediction - Function to compute raw predictions
3. make_prediction - Function to compute CLEAN predictions
And then we will check the final prediction on the test sentence that we used in Testing Phase defined above.
1. Create a function to produce input id sample for a sentence
2. Create a function to produce raw prediction for a sentence
This function will compute the prediction and then will club tokens starting with ## into single token by maintaining the label.
3. Processing Raw Prediction
As we can see in the raw output, the total number of tokens can be greater than the length of test sentence. Also, we need to ensure that individual word pieces are further clubbed into one single word (as present in the original test sentence) so that length of sample token is equal to length of clean tokens. Hence we will create a function to achieve the same.
4. Create a function to compute clean prediction for a sentence
L. Inference
The most advantageous aspect is the ability to swiftly evaluate the model using novel, previously un-encountered sentences. In this context, we utilize the prediction of the initial word piece for each word.
Note: You can also use the label associated with maximum subwords.
Alternate Way - Giving Input as List
Here we will give input sentence as list of words
M. Saving And loading Utilities for the Model and Tokenizer
In order to serve the model generally via rest-api like FASTApi we save the trained model so that we can reuse it later.
Lets check the Loaded model and tokenizer
Again, with our loaded model and tokenizer, the results are very promising ??
Thank you, for taking out time and reading this article. I hope you have enjoyed the code.
The link to complete Jupyter Notebook for BERT (Fine-tuning) can be found here:
To be continued in Part-5, where we will explore another fine-tuning of BERT via sequence classification in detail.
References:
Stratified Sampling - On the Stratification of Multi-label data
BNY | Amazon | IIM Calcutta | IIT Delhi
1 个月Very detailed and insightful for anyone using BERT and NER.
Manager, Valuation Specialist at Deloitte, Ireland | CQF, FRM
1 个月This article is a fantastic resource for anyone looking to deepen their understanding of fine-tuning pre-trained models. The step-by-step guide is both clear and comprehensive, making it accessible even to those who are newer to NLP. I particularly appreciated how the process was broken down into digestible parts, ensuring readers can follow along and apply the concepts to their own projects. Good work ??
Senior Product Manager at Microsoft | Ex - Gartner, Unilever | IIM Lucknow | IIT Dhanbad
1 个月Very neatly explained the fine-tuning process of BERt using Named Entity Recognition. Loved the implementation and the stratified Sampling approach for multi-label dataset.
Senior AI/ML Scientist at NatWest Group (Data and Analytics)
1 个月Loved the detailed explanation of Fine-tuning of BERT via Named Entity Recognition. Also loved the approach for multi-label Stratified Sampling. Very Nicely Presented.
Associate Director, Data Science @ Majid Al Futtaim
1 个月That's a well written post, Akash. The way you explained NER’s real-world impact, especially in customer support, was spot on. Keep it up!