BERT for Topic Modeling - Bidirectional Encoders Representation of Transformers - Part 5
In this article we will delve into the implementation of fine-tuning BERT by performing Topic Modeling using PyTorch and Hugging Face Transformer's Library.
Before you move ahead it is advisable to read Fundamentals of BERT - Part 1, Fundamentals of BERT - Part 2 to get an idea of the core principles of BERT. Further, you can read Building BERT from Scratch - Part 3 to get an idea about the pre-training phase.
In order to fine-tune BERT we will try to solve a problem- TOPIC MODELING.
Topic Modeling is a kind of sequence classification problem. It can be binary or multi-class or multi-label sequence classification.
In this article we will focus on multi-label sequence classification problem.
In Multi-label Classification, each instance (or data point) can be associated with multiple labels. For example, in text categorization, a single document might be tagged with multiple topics (e.g., "sports," "health," "politics"). - This contrasts with traditional single-label classification, where each instance is assigned only one label.
Topic Modeling for Research Articles
Topic Modeling for Research Articles is a kind of Multi-label Sequence Classification, where each sequence is a textual information.
In the digital age, researchers are inundated with a vast array of scientific literature available through numerous online repositories. This wealth of information, while beneficial, has also made the task of locating relevant articles increasingly complex and time-consuming. As the volume of published research continues to grow exponentially, traditional search methods often fall short in efficiently connecting researchers with the specific information they need. To address this challenge, innovative techniques such as tagging and topic modeling have emerged as effective solutions for organizing and retrieving research articles.
The Need for Enhanced Search Capabilities
The sheer volume of research articles available online can overwhelm even the most diligent researcher. With thousands of new papers published daily across various disciplines, the ability to quickly identify pertinent studies is crucial. Researchers often rely on keywords, abstracts, and titles to filter through this vast sea of information. However, these methods can be limited by the variability in terminology, the specificity of search queries, and the potential for missing relevant articles that may not contain the exact keywords being searched for.
What is Topic Modeling?
Topic modeling is a statistical technique used to uncover the underlying themes or topics present within a collection of documents. By analyzing the text of research articles—specifically the abstract and title—topic modeling algorithms can identify patterns and group articles based on shared themes. This process involves the use of natural language processing (NLP) and machine learning techniques to extract meaningful insights from unstructured text data.
Implementation of Topic Modeling
The implementation of topic modeling typically involves several key steps:
Problem Statement
Topic Modeling for Research Articles Researchers now have access to extensive online repositories of scientific literature. This abundance of information has made it increasingly challenging to locate pertinent articles. Implementing tagging or topic modeling offers a method to assign identifiers to research articles, thereby enhancing the recommendation and search processes.
By analyzing the abstract and title of a collection of research articles, the objective is to predict the topics associated with each article in the test set.
It is important to recognize that a research article may encompass multiple topics. The abstracts and titles of the research articles are derived from the following six areas of study:
To implement Topic Modeling , we will utilize a Kaggle dataset that has been specifically curated for Topic Modeling task.
We will use KAGGLE DATASET - https://www.kaggle.com/datasets/vin1234/janatahack-independence-day-2020-ml-hackathon
Download Link - https://www.kaggle.com/datasets/vin1234/janatahack-independence-day-2020-ml-hackathon?select=train.csv
About Kaggle Dataset
The dataset have following columns:
Data Snapshot
Modeling Strategy
In the code, we will utilize BertForSequenceClassification, a model provided by the Transformers library from HuggingFace. This model is built on the BERT architecture and features a classification head, enabling it to perform predictions at the sequence level.
This article explores the concept of transfer learning, which involves initially pre-training a large neural network in an unsupervised manner, followed by fine-tuning the network for a specific task. In this instance, BERT serves as the pre-trained neural network, having been trained on two tasks: masked language modeling and next sentence prediction.
Fine-tuning involves supervised learning, which indicates that a labeled dataset is required.
Note:
In summary, this article is intended for individuals who are already well-versed in these foundational topics, as it will build upon this knowledge to explore more advanced concepts and applications related to deep learning, BERT, and the use of PyTorch for implementing various models and techniques.
This article is divided into following sections:
A. Setting Environment
B. Set the Device (cpu or mps)
C. Reading Dataset
D. Pre-Processing
E. Explore Distribution of Labels
F. Stratified Splitting
G. Explore Distribution of Labels - Training, Validation and Testing Dataset
H. Modeling (includes training Loop)
I. Evaluation
J. Helper Function to make Predictions
K. Inference
L. Saving and Loading Utilities for Model
A. Setting up the environment and importing Python libraries
Please install the following libraries to set up the environment:
B. Set Device - Check if MPS (Metal Performance Shaders) is available
PyTorch uses the new Metal Performance Shaders (MPS) backend for GPU training acceleration. This MPS backend extends the PyTorch framework, providing scripts and capabilities to set up and run operations on Mac.
Get the device
C. Reading Dataset
Load the Topic Modeling dataset provided from a specified path into a pandas dataframe.
Peek into training data
Peek into testing data
Let us now explore the pre-processing steps.
D. Preprocessing Data
1. Raw Statistics of the Training Data + Testing Data
2. Analyze the raw training dataset to perform some basic necessary checks to validate the data
2.1 Check for duplicate IDs
Check if the ID column have unique values in raw training dataset. We are doing this step just to make sure and double check that ID do not have any duplicate values.
3. Set the target Columns
4. Convert Latex encoded text into plain text + replace newline character by a space
Generally, Research paper includes latex encoded text to create equations. So we need to decode to this encoded latex into plain text. For example $\\theta$ should get resolve into θ. Also, it can be see that there is a newline character '\n' present inside the text. So we need to remove this by a space ' '.
4.1. Lets peek into a random text
4.2. Convert latex text to plain text
5. Set the feature 'text' column
In our research paper, we incorporate both a title and an abstract, each presented in a textual format. The title serves as a concise representation of the main topic or focus of the research, while the abstract provides a brief summary of the key objectives, methodologies, findings, and implications of the study. To enhance our analysis and improve the effectiveness of our feature extraction process, we will merge these two elements into a single, cohesive set of textual data.
This unified dataset will allow us to leverage the combined information from both the title and the abstract, facilitating a more comprehensive understanding of the research content. By treating the merged text as a singular feature, we aim to capture the essence of the research more effectively, which can be beneficial for various applications such as text classification, information retrieval, and machine learning models. This approach not only streamlines our data processing but also enriches the contextual information available for further analysis.
5.1. Peek into the text column
6. Further preprocess the text using the following function to make the text cleaner
6.1. Apply the preprocessing function
6.2. Peek into a random row
7. Drop the unwanted columns
Let us now drop the columns 'TITLE' and 'ABSTRACT' as they are no longer needed.
7.1. Verify if the columns are indeed dropped
8. Create additional columns to get an idea of sentence length, Bert tokens length
First we will set the tokenizer to be used which is a Bert-Base-Uncased tokenizer. We will try to get an idea of the number of words in sentence, number of tokens produce for every sentence after tokenization and the differences between them.
BERT uses Word-piece algorithm to tokenize a sentence. WordPiece is a subword-based tokenization algorithm and so it can split a word into multiple tokens. Hence number of tokens will always be greater than or equal to number of words for every sentence.
8.5. Peek at a random row
9. Lets see the distributions of number of words, number of tokens and their difference
9.1. Histogram of number of words (a.k.a sentence length)
Lets see the histogram plot.
9.3. Check the training data where number of words are less than 10
9.4. Histogram of length of tokenized sentence (# of tokens created by tokenizer)
Lets see the histogram plot.
9.5. Histogram of difference in length between sentence length and bert tokenized sentence length
Lets see the histogram plot.
10. Drop the newly created columns for lengths
11. Compute label Maps
Create two maps:
领英推荐
E. Explore distributions of Labels
This section holds significant importance. Given that our dataset is characterized by both multi-label attributes and an imbalance, it is crucial for us to understand the distribution of each label.
1. Create a new dataframe to explore distributions of labels
2. Descriptive Statistics of each label
3. Frequency of each label
Lets see the bar frequency plot.
F. Stratified (or Balanced) Split
Stratified sampling is a statistical technique employed to guarantee that particular subgroups within a population are sufficiently represented in a sample. This method is especially beneficial when dealing with a heterogeneous population, which comprises various groups that may exhibit differing characteristics or behaviors. By segmenting the population into separate subpopulations, referred to as strata, researchers can achieve more precise and dependable estimates of the overall parameters of the population.
1. Helper Functions to Create Balanced Split function
2. Spliting the train data into training + validation data
The dataset will be partitioned into training, validation, ensuring that the distribution of labels is maintained through the use of stratified sampling. The first phase entails categorizing the data into two groups: seen and unseen, with the unseen data allocated as the testing set. Following this, the seen data will be subdivided into training and validation sets. It is crucial to recognize that the seen data represents the information to which the model is exposed before it is implemented in a production setting.
Note: In our case, we already have testing dataset.
2. a) All Data - Train Data(90%) + Validation Data (10%)
G. Explore Distribution of Labels - Training, Validation
It is essential to verify if the data has been stratified appropriately. To achieve this, we will analyze the distribution of each label across all datasets. Additionally, it is crucial to ensure that the distribution of label combinations for each dataset is preserved.
1. Check distributions of 1st order
2. Check distributions of 2nd order
Let us now see the distribution of two labels combined.
3. Check distribution of 3rd order
Let us now see the distribution of three labels combined.
H. Modelling
Let us now set the parameters to build the model.
1. Modelling Parameters
2. Preparing the dataset and dataloader
2.1. Create Custom Topic Dataset class
2.3. Create Train Dataset, Valid Dataset
2.4. Create Train DataLoader and Valid DataLoader
3. Create Model
We present a novel language representation model known as BERT, which stands for Bidirectional Encoder Representations from Transformers. In contrast to other recent models, BERT is specifically engineered to pre-train deep bidirectional representations from unlabeled text by simultaneously considering both left and right contexts across all layers. Consequently, the pre-trained BERT model can be easily fine-tuned by adding a single output layer, enabling the development of cutting-edge models for various tasks, including question answering and language inference, without the need for significant alterations to the task-specific architecture.
4. Set the Optimizer
Optimizer - AdamW:
AdamW is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update.
Learning Rate:
Learning rate is a tuning parameter in machine learning and statistics that controls how much a model's parameters adjust during each iteration of an optimization algorithm. It's a floating point number that's usually between 0.01 and 0.1
Adam Epsilon:
The parameter epsilon shows up in the update step.
It is primarily used as a guard against a zero second second moment causing a division by zero case. If it is too large it will bias the moment estimation.
5. Set the Scheduler
A scheduler is to make learning rate adaptive to the gradient descent optimization procedure, so you can increase performance and reduce training time.
In PyTorch, a model is updated by an optimizer and learning rate is a parameter of the optimizer.
Learning rate schedule is an algorithm to update the learning rate in an optimizer.
6. Create the training loop
A typical training loop in PyTorch iterates over the batches for a given number of epochs.
Since we are using learning scheduler for the adaptive learning rate, we call scheduler.step() to update the learning rate (as per the learning scheduler) after every epoch.
7. Training Phase
8. Draw the graph for training loss to see its progress after every epoch
I. Evaluation
1. Create the Evaluation Function
2. Check Performance of model on Validation Dataset
3. Classification report
3.1. Helper function to compute classification report
3.2. Compute the classification report
4. Compute Classification report at label level
5. Confusion Matrix
Lets see the confusion matrix for each individual label.
J. Helper Functions to make predictions
We will create the following functions:
K. Inference
The most advantageous aspect is the ability to swiftly evaluate the model using novel, previously unencountered sentences.
L. Saving And loading Utilities for the Model and Tokenizer
Function to save and load the model & tokenizer
n order to serve the model generally via rest-api like FASTApi we save the trained model so that we can reuse it later.
2. Save the trained model
3. Save the Tokenizer
4. Load the Model and Tokenizer (for future use)
5. Lets check the Loaded model and tokenizer
Again, with our loaded model and tokenizer, the results are very promising ??
Thank you, for taking out time and reading this article. I hope you have enjoyed the code.
The link to complete Jupyter Notebook for BERT (Fine-tuning) can be found here:
References:
Stratified Sampling - On the Stratification of Multi-label data
Senior AI/ML Scientist at NatWest Group (Data and Analytics)
5 个月Very interesting article ! This article provides the clear steps to perform Topic Modelling via BERT. The detailed approach presented is simply amazing !
Senior Data Scientist
5 个月Interesting
BNY | Amazon | IIM Calcutta | IIT Delhi
5 个月Very well articulated
Manager, Valuation Specialist at Deloitte, Ireland | CQF, FRM
5 个月Loved it. Good continuation article on BERT