FineTuning BERT- Named Entity Recognition - Bidirectional Encoders Representation of Transformers - Part 4

FineTuning BERT- Named Entity Recognition - Bidirectional Encoders Representation of Transformers - Part 4

In this article we will delve into the implementation of fine-tuning BERT with PyTorch through code.

Before you move ahead it is advisable to read Fundamentals of BERT - Part 1 , Fundamentals of BERT - Part 2 to get an idea of the core principles of BERT. Further, you can read Building BERT from Scratch - Part 3 to get an idea about the pre-training phase.

In order to fine-tune BERT we will try to solve a problem- NAMED ENTITY RECOGNITION.

The objective of Named Entity Recognition (NER) is to identify and categorize named entities within a given sequence. These named entities fall into pre-established categories that are selected based on the specific use case, including but not limited to names of individuals, organizations, locations, codes, temporal expressions, and monetary amounts. In essence, NER seeks to designate a class to each token, which is typically a single word, within the sequence. Consequently, NER is often described as a form of token classification.

To Fine-tune BERT for the task of Named Entity Recognition (NER), we will adopt a systematic approach that leverages the strengths of the BERT architecture while tailoring it to effectively identify and classify named entities within text. Named Entity Recognition is a crucial component of natural language processing, as it involves the identification of proper nouns and specific entities such as names of people, organizations, locations, dates, and other relevant terms within a given text.

To achieve this, we will utilize a Kaggle dataset that has been specifically curated for NER tasks. This dataset typically contains annotated text samples where named entities are labeled according to their respective categories. By employing this dataset, we can ensure that our model is trained on high-quality, relevant examples that reflect the complexities and nuances of real-world language use.

We will use KAGGLE DATASET - Name Entity Recognition (NER) Dataset

Source - Kaggle Dataset

Download Link - https://www.kaggle.com/datasets/debasisdotcom/name-entity-recognition-ner-dataset?select=NER+dataset.csv

About Kaggle Dataset

Context

This is a very clean dataset and is for anyone who wants to try his/her hand on the NER ( Named Entity recognition ) task of NLP.

Content

The dataset with 1M x 4 dimensions contains columns = ['# Sentence', 'Word', 'POS', 'Tag'] and is grouped by #Sentence.

Columns

Word: This column contains English dictionary words form the sentence it is taken from.

POS: Parts of speech tag

Tag: Standard named entity recognition tags as follows:

'O', 'B-geo', 'B-gpe', 'B-per', 'I-geo', 'B-org', 'I-org', 'B-tim', 'B-art', 'I-art', 'I-per', 'I-gpe', 'I-tim', 'B-nat', 'B-eve', 'I-eve', 'I-nat'

There are 8 category tags, each with a "beginning" (B) and "inside" (I) variant, and the "outside" (O) tag.

It is not really clear what these tags mean - "geo" probably stands for geographical entity, "gpe" for geopolitical entity, , "per" stands for person, "tim" stands for time and so on.

Example

Consider the following sample sentence:

Samsung was founded in 1938 by Lee Byung-chul as a trading company.

And mapping of NER Tag associated with each word is as follows

  • 'Samsung' - B-org
  • 'was' - O
  • 'founded' - O
  • 'in' - O
  • '1938' - B-tim
  • 'by' - O
  • 'Lee' - B-per
  • 'Byung-chul' - I-per
  • 'as' - O
  • 'a' - O
  • 'trading' - O
  • 'company' - O
  • '.' - O

Note: We only need Sentence, Word and Tag columns for Named Entity Recognition

What is NER?

NER, or Named Entity Recognition, is a component of Natural Language Processing (NLP) that focuses on identifying and categorizing named entities within a given text.

What it does?

NER locates and classifies named entities in text, such as people, places, organizations, time expressions, and quantities.

How it works?

Named Entity Recognition (NER) detects specific words or phrases within a document and subsequently categorizes them into established classifications.

Why it's useful?

Named Entity Recognition (NER) enhances document translation by offering contextual insights that aid in comprehending sentences. For instance, in the sentence "Apple released a new iPhone today," NER can recognize "Apple" as an organization.

Named Entity Recognition (NER) can be utilized in various practical scenarios where the analysis of extensive text data proves beneficial. Some applications of NER include:

  • Enhancing customer support by organizing and prioritizing user inquiries, complaints, and questions, thereby providing businesses with valuable insights into their clientele.
  • Assisting in the categorization of applicants' resumes, thereby expediting the recruitment process.
  • Enhancing search and recommendation systems through the identification of recognized entities.
  • Facilitating the search and extraction of pertinent information from documents and blog articles.

Modeling Strategy

In the code, we will utilize BertForTokenClassification, a model provided by the Transformers library from HuggingFace. This model is built on the BERT architecture and features a token classification head, enabling it to perform predictions at the token level instead of the sequence level.

Named entity recognition is generally approached as a token classification task, which is the purpose of our implementation

This article explores the concept of transfer learning, which involves initially pre-training a large neural network in an unsupervised manner, followed by fine-tuning the network for a specific task. In this instance, BERT serves as the pre-trained neural network, having been trained on two tasks: masked language modeling and next sentence prediction.

Fine-tuning involves supervised learning, which indicates that a labeled dataset is required.

Note:

  1. This article presupposes that the reader possesses a fundamental understanding of several key concepts and technologies that are essential for effectively engaging with the content presented herein. Specifically, it assumes familiarity with deep learning, which is a subset of machine learning that focuses on algorithms inspired by the structure and function of the brain, particularly artificial neural networks.
  2. Additionally, the reader should have a working knowledge of BERT (Bidirectional Encoder Representations from Transformers), a state-of-the-art natural language processing model developed by Google. BERT is designed to understand the context of words in a sentence by considering the words that come before and after them, making it particularly effective for tasks such as text classification, question answering, and language inference.
  3. Furthermore, proficiency in the PyTorch framework is also assumed. PyTorch is an open-source machine learning library widely used for applications in deep learning and artificial intelligence. It provides a flexible and dynamic computational graph, which allows for easy experimentation and debugging, making it a popular choice among researchers and practitioners in the field.

In summary, this article is intended for individuals who are already well-versed in these foundational topics, as it will build upon this knowledge to explore more advanced concepts and applications related to deep learning, BERT, and the use of PyTorch for implementing various models and techniques.

This article is divided into following sections:

A. Setting Environment

B. Set the Device (cpu or mps)

C. Reading Dataset

D. Pre-Processing

E. Explore Distribution of Labels

F. Stratified Splitting

G. Explore Distribution of Labels - Training, Validation and Testing Dataset

H. Modeling (includes training Loop)

I. Evaluation

J. Testing

K. Helper Function to make clean Predictions

L. Inference

I. Saving and Loading Utilities for Model

A. Setting up the environment and importing Python libraries

Please install the following libraries to set up the environment:

  • pandas
  • numpy
  • sklearn
  • pytorch
  • transformers
  • matplotlib
  • seaborn
  • scikit-multilearn

Setting up the environment

B. Set Device - Check if MPS (Metal Performance Shaders) is available

PyTorch uses the new Metal Performance Shaders (MPS) backend for GPU training acceleration. This MPS backend extends the PyTorch framework, providing scripts and capabilities to set up and run operations on Mac.

Function to set device
Get the device to be used

C. Reading Dataset

Load the NER dataset provided from a specified path into a pandas dataframe.

Reading Dataset

Peek into raw data:

Snapshot of Raw Data

Let us now explore the pre-processing steps.

D. Preprocessing Data

1. Replacing the NaN values by 'Sentence #'

The sentence number column denoted as 'Sentence #' has finite values only at the beginning of the sentence and rest values are NaN. We will assign the sentence number to every word by forward fill method.

Forward Fill

Peek into transformed dataset.

Every word has sentence number associated with it

1.1 Check the shape

Shape of dataframe

2. Drop the unwanted 'POS' column

We don't need POS column hence we will drop it.

Remaining columns

3. Remove all rows that contain NaN

We need to remove all rows which has a NaN entry.

Remove NaN entries

4. Verify if there are any NaN values in Word and Tag Columns

Let us verify that we have successfully removed all NaN entries.

4.1 Tag columns

No NaN entry in Tag Column

4.2 Word columns

No NaN entry in Word Column

5. Checking Improper Words

We need to remove all those words which has escaped characters like \X94, \X85 etc.

These words appear as blank but still they have tag mapped so they can produce a mapping problem while tokenization.

5.1 Helper Function

Functions to check if there are any escaped unicode characters. We will remove all the words with length less than or equal to 2 and has escaped unicode character.

Helper Function to get proper words

5.2 Create a bool column - 'is_proper_word'

Bool Column indicating if a word is proper (free from escaped characters)

5.3 Check the data which has improper words

Get the improper words
The total number of

5.4 Retain the data which has only proper words

Keep the proper words only

5.5 Drop the newly created column "is_proper_word"

Drop the "is_proper_word" column

6. Frequency of each tag

Compute the frequency of each tag.

Frequency of each Tag (or Label)

7. Count the frequencies of sub-tags in decreasing order

We have sub-tags like geo, org, eve, per e.t.c. Lets compute the frequency of each tag sub-tag in decreasing order.

Sub-Tag Frequency

8. Dealing with Tags that could potentially act as Noise

From previous section, we can see that sub-tags - art, eve and nat are relatively very low in number.Hence, Tags - "B-art", "I-art", "B-eve", "I-eve", "B-nat", "I-nat" could potentially act as noise. Also note that Tag I-gpe is also relatively very low in count and hence it could also act as noise. Therefore we will replace all such tags that have the potential to act as noise.

Labels to Remove

8.1. Convert such subtags to "O" which is most frequent

For every tag that is removed, we will substitute it with "O" to keep the interaction information of the associated word with the other words.

Substitute the tags to be removed by "O"

8.2 Compute updated frequency of each tags

Frequency of remaining tag

9. Compute label Maps

Create two maps:

  1. label2id - maps string labels to its integer index
  2. id2label - maps integer index of labels to strings

Label Maps - id to string and string to id

10. Create complete sentences with their labels as one string

Snapshot of transformed data

10.1 Drop duplicate sentences + labels and reset index

Unique Sentences

11. Remove the sentences that contain only O tags and don't contain any other target labels

Removing sentences that do contain target entities

12. Create additional columns to get an idea of sentence length, bert tokens length

First we will set the tokenizer to be used which is a Bert-Base-Uncased tokenizer. We will try to get an idea of the number of words in sentence, number of tokens produce for every sentence after tokenization and the differences between them.

BERT uses Word-piece algorithm to tokenize a sentence. WordPiece is a subword-based tokenization algorithm and so it can split a word into multiple tokens. Hence number of tokens will always be greater than or equal to number of words for every sentence.

Length of Sentence and Tokenized Sentence

peek into data

13. Lets see the Histograms of tokenized sentence length and difference of length

13.1. Histogram of difference in sentence length and label length (# of label)

Computing Histogram data
Histogram of difference between number of words and number of labels

13.2. Histogram of length of tokenized sentence (# of tokens created by tokenizer)

Computing Histogram data
Histogram of length of tokenized sentence

13.3 Histogram of difference in length between sentence length and bert tokenized sentence length

Computing Histogram data
Histogram of difference between number of words and number of tokens

14. Setting meaningful Max Len

We need to set the maximum sequence length so that every sequence have equal length.

14.1. Compute the number of sentences where tokenized length is greater than a certain threshold

Only 4 sentences have length > 78
Only 3 sentences have length > 80
Only 1 sentences have length > 128

Conclusion:

  1. Only 1 sentences have tokenized length greater than 128.
  2. The one with the largest difference in length is 48 is also present

So we can safely keep the max sequence length or max length to be 128

14.2 Dropping the unwanted columns and resetting the index

DataFrame of Clean Sentences along with their labels

14.3. Let's verify that a random sentence and its corresponding tags are correct:

Verifying a random sentence and it's labels

Peek into data

Data Snapshot of clean data

E. Explore distributions of Labels

This section holds significant importance. Given that our dataset is characterized by both multi-label attributes and an imbalance, it is crucial for us to understand the distribution of each label.

1. Create a new dataframe to explore distributions of labels

Create a new dataframe - balance dataframe (bal_df)

2. Set the label column to contain list of labels and "s_id" column as index

Set the index to be sentence id and convert string of labels into list of labels

3. Explode the labels column to have one only one label per row

Snapshot of transformed balance dataframe with one label per row

4. Create the dummies for labels column

Create dummies for each label

5. Count the frequency of each label for every sentence

Count the number of each label for every sentence

6. Descriptive Statistics of each label

Descriptive Statistics of the balance dataframe

7. Plot the kernel density plots for each label

Compute the kernel density
Violin Plot displaying Kernel Density Information

8. Frequency of each label

Updated Frequency of each label

F. Stratified (or Balanced) Split

In statistics, stratified sampling is a method used to ensure that specific subgroups within a population are adequately represented in a sample. This technique is particularly useful when the population is heterogeneous, meaning it consists of diverse groups that may have different characteristics or behaviors. By dividing the population into distinct subpopulations, known as strata, researchers can obtain more accurate and reliable estimates of the overall population parameters.

1. Helper Functions to Create Balanced Split function

Helper Function

2. Splitting the Dataset into training, validation and test dataset

The dataset will be divided into training, validation, and testing subsets while preserving the distribution of labels through stratified sampling.

The initial step involves separating the data into two categories: seen and unseen, with the unseen data designated as the testing set. Subsequently, the seen data will be further divided into training and validation sets.

It is important to consider the seen data as the information that the model is exposed to prior to its deployment in a production environment.

Flowchart for Splitting of Dataset
All Data = Seen Data + Unseen Data/ Test Data
Seen Data = Training Data + Validation Data
All Data = Training + Validation + Testing
Reset the indices

G. Explore Distribution of Labels - Training, Validation and Testing Data

We should confirm whether the data has been divided in a stratified manner. To do so we will compute the distribution of each label for every dataset.

Important- We should also confirm that the distribution of combination of labels for each dataset is also maintained.

Compute one hot encoding for unique labels present in a sentence

1. Check distributions of 1st order

Here 1st order means that we are only looking at a single label and not the comabinations.

Compute the distribution of single label
Proportion of each label for every dataset
Proportion dataframe
Distribution of labels across different datasets.

Here we can see that distribution is almost same for each label in all the three datasets

2. Check distributions of 2nd order

Let us now see the distribution of two labels combined.

Distribution of higher 2nd order

Here (0,3) means label pair (O, B-Per). It has roughly the same value 0.33 for training, testing and validation dataset. The value 0.33 implies that 33% of the sentences in all the dataset has at-least one O and one B-Per label together.

Similarly (0,0) i.e. (O,O) has entry 0.9999 (approx. 1) for training data, 1.0 for validation and 1.0 implies that all the sentences have at-least two O labels.

2. Check distributions of 3rd order

Distribution of higher 2rd order

Here we are considering three labels at a time.

(0, 3, 3) is (O, B-Per, B-Per) and has roughly same value 0.33 for each training, validation and testing dataset. The value 0.33 implies that 33% of sentences in each dataset contains at-least one O and two B-Per labels.

H. Modelling

Let us now set the parameters to build the model.

1. Modelling Parameters

2. Preparing the dataset and data-loader

WordPiece tokenization may divide a word into several tokens; therefore, we will assign the same label to each of these tokens as that of the original word splitted.

Tokenization of one word at a time
NERDataset class partial content
Remaining NERDataset class content
Create custom NERDataset for training, validation and testing data
Create the custom DataLoader for training, validation and testing data

3. Create Model

We present a novel language representation model known as BERT, which stands for Bidirectional Encoder Representations from Transformers. In contrast to other recent models, BERT is specifically engineered to pre-train deep bidirectional representations from unlabeled text by simultaneously considering both left and right contexts across all layers. Consequently, the pre-trained BERT model can be easily fine-tuned by adding a single output layer, enabling the development of cutting-edge models for various tasks, including question answering and language inference, without the need for significant alterations to the task-specific architecture.

4. Set the Optimizer

Optimizer - AdamW:

AdamW is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update.

Learning Rate:

Learning rate is a tuning parameter in machine learning and statistics that controls how much a model's parameters adjust during each iteration of an optimization algorithm. It's a floating point number that's usually between 0.01 and 0.1

Adam Epsilon:

The parameter epsilon shows up in the update step.

theta update equation

It is primarily used as a guard against a zero second second moment causing a division by zero case. If it is too large it will bias the moment estimation.

Set learning rate and epsilon

5. Set the Scheduler

A scheduler is to make learning rate adaptive to the gradient descent optimization procedure, so you can increase performance and reduce training time.

In PyTorch, a model is updated by an optimizer and learning rate is a parameter of the optimizer.

Learning rate schedule is an algorithm to update the learning rate in an optimizer.

Learning Scheduler

6. Create the training loop

A typical training loop in PyTorch iterates over the batches for a given number of epochs.

  • In each batch iteration, we first compute the forward pass to obtain the neural network outputs.
  • Then, we reset the gradient from the previous iteration and perform backpropagation to obtain the gradient of the loss with respect to the model weights.
  • Finally, we update the weights based on the loss gradients using stochastic gradient descent.

Since we are using learning scheduler for the adaptive learning rate, we call scheduler.step() to update the learning rate (as per the learning scheduler) after every epoch.

Function to train the model - 1st part
Function to train the model - 2nd part

7. Training Phase

Let us now train the model.

Training intermediate results
Training full results for every epoch

8. Draw the graph for training loss to see its progress after every epoch

Training Loss
Plot Training Loss for every Epoch

I. Evaluation

Let us evaluate the performance of model on validation data.

1. Create the Evaluation Function

Function to evaluate the Model - Part 1
Function to evaluate the Model - Part 2

2. Check Performance of model on Validation Dataset

Validation Accuracy is 94.16 %

3. Compute the classification report

Function to compute classification report
Classification Report for Validation Dataset

4. Compute Classification report at label level

Classification Report For each Label

5. Confusion Matrix

Compute Confusion Matrix
Confusion Matrix - Validation Data
Normalized Confusion Matrix
Confusion Matrix - Validation Data
Compute confusion matrix for each label
Confusion Matrix for each label

----------------------------- Model Training and Evaluation Ends Here ---------------------

J. Testing - Test Dataset

Let us now test the performance of model on Testing Dataset.

1. Check Performance of Model on Testing Dataset

Note: Here Validation loss is actually testing loss

Validation loss for testing data - partial result
Validation loss for testing data - full result

3. Compute FULL classification report

Classification Report for Testing Data

4. LABEL-WISE Classification report

Classification Report of each label for testing data

5. Confusion Matrix

Compute Multi-label Confusion Matrix
Multi-label Confusion Matrix for Testing Data
Normalize Multi-label Confusion Matrix for Testing Data
Multi-label Confusion Matrix for Testing Data
Compute confusion matrix for each label
Confusion Matrix for each label - Testing Data

-------------------------------- Testing Phase Ends Here ----------------------------------

K. Helper Functions to make predictions

In order to make clean predictions for unseen sentences, we will create the following functions:

1. create_input_ids - Function to create Input Ids for a given sentence

2. make_raw_prediction - Function to compute raw predictions

3. make_prediction - Function to compute CLEAN predictions

And then we will check the final prediction on the test sentence that we used in Testing Phase defined above.

1. Create a function to produce input id sample for a sentence

Function to create input ids

2. Create a function to produce raw prediction for a sentence

This function will compute the prediction and then will club tokens starting with ## into single token by maintaining the label.

Function to produce raw prediction

3. Processing Raw Prediction

As we can see in the raw output, the total number of tokens can be greater than the length of test sentence. Also, we need to ensure that individual word pieces are further clubbed into one single word (as present in the original test sentence) so that length of sample token is equal to length of clean tokens. Hence we will create a function to achieve the same.

Function to process raw prediction

4. Create a function to compute clean prediction for a sentence

Function to make clean end-to-end prediction

L. Inference

The most advantageous aspect is the ability to swiftly evaluate the model using novel, previously un-encountered sentences. In this context, we utilize the prediction of the initial word piece for each word.

Note: You can also use the label associated with maximum subwords.

Input Sentence
Results of the Model

Alternate Way - Giving Input as List

Here we will give input sentence as list of words

Results of the Model

M. Saving And loading Utilities for the Model and Tokenizer

In order to serve the model generally via rest-api like FASTApi we save the trained model so that we can reuse it later.

  1. Function to save and load the model & tokenizer

Function to save the model and tokenizer
Save the trained model
Save the tokenizer
Load the saved model and saved tokenizer

Lets check the Loaded model and tokenizer

Prediction via loaded model
Results via loaded Model

Again, with our loaded model and tokenizer, the results are very promising ??


Thank you, for taking out time and reading this article. I hope you have enjoyed the code.

The link to complete Jupyter Notebook for BERT (Fine-tuning) can be found here:

BERT-Named Entity Recognition Jupyter Notebook


To be continued in Part-5, where we will explore another fine-tuning of BERT via sequence classification in detail.


References:

Dataset - Name Entity Recognition (NER) Dataset (KAGGLE)

Stratified Sampling - On the Stratification of Multi-label data

Saurabh Choudhary

BNY | Amazon | IIM Calcutta | IIT Delhi

1 个月

Very detailed and insightful for anyone using BERT and NER.

Amit Bisai

Manager, Valuation Specialist at Deloitte, Ireland | CQF, FRM

1 个月

This article is a fantastic resource for anyone looking to deepen their understanding of fine-tuning pre-trained models. The step-by-step guide is both clear and comprehensive, making it accessible even to those who are newer to NLP. I particularly appreciated how the process was broken down into digestible parts, ensuring readers can follow along and apply the concepts to their own projects. Good work ??

Mukesh Kumar

Senior Product Manager at Microsoft | Ex - Gartner, Unilever | IIM Lucknow | IIT Dhanbad

1 个月

Very neatly explained the fine-tuning process of BERt using Named Entity Recognition. Loved the implementation and the stratified Sampling approach for multi-label dataset.

Ashutosh Shukla, Ph.D.

Senior AI/ML Scientist at NatWest Group (Data and Analytics)

1 个月

Loved the detailed explanation of Fine-tuning of BERT via Named Entity Recognition. Also loved the approach for multi-label Stratified Sampling. Very Nicely Presented.

Mohit Ahuja

Associate Director, Data Science @ Majid Al Futtaim

1 个月

That's a well written post, Akash. The way you explained NER’s real-world impact, especially in customer support, was spot on. Keep it up!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了