BERT for Topic Modeling - Bidirectional Encoders Representation of Transformers - Part 5
Topic Modeling

BERT for Topic Modeling - Bidirectional Encoders Representation of Transformers - Part 5

In this article we will delve into the implementation of fine-tuning BERT by performing Topic Modeling using PyTorch and Hugging Face Transformer's Library.

Before you move ahead it is advisable to read Fundamentals of BERT - Part 1, Fundamentals of BERT - Part 2 to get an idea of the core principles of BERT. Further, you can read Building BERT from Scratch - Part 3 to get an idea about the pre-training phase.

In order to fine-tune BERT we will try to solve a problem- TOPIC MODELING.

Topic Modeling is a kind of sequence classification problem. It can be binary or multi-class or multi-label sequence classification.

In this article we will focus on multi-label sequence classification problem.

In Multi-label Classification, each instance (or data point) can be associated with multiple labels. For example, in text categorization, a single document might be tagged with multiple topics (e.g., "sports," "health," "politics"). - This contrasts with traditional single-label classification, where each instance is assigned only one label.

Topic Modeling for Research Articles

Topic Modeling for Research Articles is a kind of Multi-label Sequence Classification, where each sequence is a textual information.

In the digital age, researchers are inundated with a vast array of scientific literature available through numerous online repositories. This wealth of information, while beneficial, has also made the task of locating relevant articles increasingly complex and time-consuming. As the volume of published research continues to grow exponentially, traditional search methods often fall short in efficiently connecting researchers with the specific information they need. To address this challenge, innovative techniques such as tagging and topic modeling have emerged as effective solutions for organizing and retrieving research articles.

The Need for Enhanced Search Capabilities

The sheer volume of research articles available online can overwhelm even the most diligent researcher. With thousands of new papers published daily across various disciplines, the ability to quickly identify pertinent studies is crucial. Researchers often rely on keywords, abstracts, and titles to filter through this vast sea of information. However, these methods can be limited by the variability in terminology, the specificity of search queries, and the potential for missing relevant articles that may not contain the exact keywords being searched for.

What is Topic Modeling?

Topic modeling is a statistical technique used to uncover the underlying themes or topics present within a collection of documents. By analyzing the text of research articles—specifically the abstract and title—topic modeling algorithms can identify patterns and group articles based on shared themes. This process involves the use of natural language processing (NLP) and machine learning techniques to extract meaningful insights from unstructured text data.

Implementation of Topic Modeling

The implementation of topic modeling typically involves several key steps:

  1. Data Collection: Researchers gather a substantial dataset of research articles, which may include abstracts, titles, and other relevant metadata.
  2. Preprocessing: The text data undergoes preprocessing to clean and prepare it for analysis. This may include removing stop words, stemming or lemmatization, and tokenization.
  3. Model Selection: Various topic modeling algorithms can be employed, such as Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), or more advanced neural network-based approaches like BERT. The choice of model depends on the specific requirements of the research and the nature of the dataset.
  4. Training the Model: The selected model is trained on the dataset to identify topics based on the co-occurrence of words and phrases within the articles. Each article is then assigned a probability distribution over the identified topics.
  5. Evaluation and Refinement: The results are evaluated for coherence and relevance. Researchers may refine the model by adjusting parameters or incorporating additional.

Problem Statement

Topic Modeling for Research Articles Researchers now have access to extensive online repositories of scientific literature. This abundance of information has made it increasingly challenging to locate pertinent articles. Implementing tagging or topic modeling offers a method to assign identifiers to research articles, thereby enhancing the recommendation and search processes.

By analyzing the abstract and title of a collection of research articles, the objective is to predict the topics associated with each article in the test set.

It is important to recognize that a research article may encompass multiple topics. The abstracts and titles of the research articles are derived from the following six areas of study:

  1. Computer Science
  2. Physics
  3. Mathematics
  4. Statistics
  5. Quantitative Biology
  6. Quantitative Finance

To implement Topic Modeling , we will utilize a Kaggle dataset that has been specifically curated for Topic Modeling task.

We will use KAGGLE DATASET - https://www.kaggle.com/datasets/vin1234/janatahack-independence-day-2020-ml-hackathon

Download Link - https://www.kaggle.com/datasets/vin1234/janatahack-independence-day-2020-ml-hackathon?select=train.csv

About Kaggle Dataset

The dataset have following columns:

  • ID - Unique ID for each article
  • TITLE - Title of the research article
  • ABSTRACT - Abstract of the research article
  • Computer Science - Whether article belongs to topic computer science (1/0)
  • Physics - Whether article belongs to topic physics (1/0)
  • Mathematics - Whether article belongs to topic Mathematics (1/0)
  • Statistics - Whether article belongs to topic Statistics (1/0)
  • Quantitative Biology - Whether article belongs to topic Quantitative Biology (1/0)
  • Quantitative Finance - Whether article belongs to topic Quantitative Finance (1/0)

Data Snapshot

Source - Kaggle Dataset

Modeling Strategy

In the code, we will utilize BertForSequenceClassification, a model provided by the Transformers library from HuggingFace. This model is built on the BERT architecture and features a classification head, enabling it to perform predictions at the sequence level.

This article explores the concept of transfer learning, which involves initially pre-training a large neural network in an unsupervised manner, followed by fine-tuning the network for a specific task. In this instance, BERT serves as the pre-trained neural network, having been trained on two tasks: masked language modeling and next sentence prediction.

Fine-tuning involves supervised learning, which indicates that a labeled dataset is required.

Note:

  1. This article presupposes that the reader possesses a fundamental understanding of several key concepts and technologies that are essential for effectively engaging with the content presented herein. Specifically, it assumes familiarity with deep learning, which is a subset of machine learning that focuses on algorithms inspired by the structure and function of the brain, particularly artificial neural networks.
  2. Additionally, the reader should have a working knowledge of BERT (Bidirectional Encoder Representations from Transformers), a state-of-the-art natural language processing model developed by Google. BERT is designed to understand the context of words in a sentence by considering the words that come before and after them, making it particularly effective for tasks such as text classification, question answering, and language inference.
  3. Furthermore, proficiency in the PyTorch framework is also assumed. PyTorch is an open-source machine learning library widely used for applications in deep learning and artificial intelligence. It provides a flexible and dynamic computational graph, which allows for easy experimentation and debugging, making it a popular choice among researchers and practitioners in the field.

In summary, this article is intended for individuals who are already well-versed in these foundational topics, as it will build upon this knowledge to explore more advanced concepts and applications related to deep learning, BERT, and the use of PyTorch for implementing various models and techniques.

This article is divided into following sections:

A. Setting Environment

B. Set the Device (cpu or mps)

C. Reading Dataset

D. Pre-Processing

E. Explore Distribution of Labels

F. Stratified Splitting

G. Explore Distribution of Labels - Training, Validation and Testing Dataset

H. Modeling (includes training Loop)

I. Evaluation

J. Helper Function to make Predictions

K. Inference

L. Saving and Loading Utilities for Model

A. Setting up the environment and importing Python libraries

Please install the following libraries to set up the environment:

  • pandas
  • numpy
  • sklearn
  • pytorch
  • transformers
  • matplotlib
  • seaborn
  • scikit-multilearn
  • pylatexenc


Setting up the Environment

B. Set Device - Check if MPS (Metal Performance Shaders) is available

PyTorch uses the new Metal Performance Shaders (MPS) backend for GPU training acceleration. This MPS backend extends the PyTorch framework, providing scripts and capabilities to set up and run operations on Mac.

Function to set the device

Get the device

Set the device

C. Reading Dataset

Load the Topic Modeling dataset provided from a specified path into a pandas dataframe.

Reading Dataset - Training and Testing

Peek into training data

training data

Peek into testing data

testing data (without labels)

Let us now explore the pre-processing steps.

D. Preprocessing Data

1. Raw Statistics of the Training Data + Testing Data

2. Analyze the raw training dataset to perform some basic necessary checks to validate the data

2.1 Check for duplicate IDs

Check if the ID column have unique values in raw training dataset. We are doing this step just to make sure and double check that ID do not have any duplicate values.

Check Duplicate IDs

3. Set the target Columns

Target Variables

4. Convert Latex encoded text into plain text + replace newline character by a space

Generally, Research paper includes latex encoded text to create equations. So we need to decode to this encoded latex into plain text. For example $\\theta$ should get resolve into θ. Also, it can be see that there is a newline character '\n' present inside the text. So we need to remove this by a space ' '.

4.1. Lets peek into a random text

4.2. Convert latex text to plain text

convert latex text into plain text

5. Set the feature 'text' column

In our research paper, we incorporate both a title and an abstract, each presented in a textual format. The title serves as a concise representation of the main topic or focus of the research, while the abstract provides a brief summary of the key objectives, methodologies, findings, and implications of the study. To enhance our analysis and improve the effectiveness of our feature extraction process, we will merge these two elements into a single, cohesive set of textual data.

This unified dataset will allow us to leverage the combined information from both the title and the abstract, facilitating a more comprehensive understanding of the research content. By treating the merged text as a singular feature, we aim to capture the essence of the research more effectively, which can be beneficial for various applications such as text classification, information retrieval, and machine learning models. This approach not only streamlines our data processing but also enriches the contextual information available for further analysis.

Create

5.1. Peek into the text column

Latex Text converted into plain text

6. Further preprocess the text using the following function to make the text cleaner

Advanced Preprocessing and Cleaning Function

6.1. Apply the preprocessing function

Further preprocess the text

6.2. Peek into a random row

Random row of training data

7. Drop the unwanted columns

Let us now drop the columns 'TITLE' and 'ABSTRACT' as they are no longer needed.

7.1. Verify if the columns are indeed dropped

8. Create additional columns to get an idea of sentence length, Bert tokens length

First we will set the tokenizer to be used which is a Bert-Base-Uncased tokenizer. We will try to get an idea of the number of words in sentence, number of tokens produce for every sentence after tokenization and the differences between them.

BERT uses Word-piece algorithm to tokenize a sentence. WordPiece is a subword-based tokenization algorithm and so it can split a word into multiple tokens. Hence number of tokens will always be greater than or equal to number of words for every sentence.

New columns to store various lengths

8.5. Peek at a random row

Peek at a Random Row

9. Lets see the distributions of number of words, number of tokens and their difference

9.1. Histogram of number of words (a.k.a sentence length)

Compute Histogram for number of words in a text

Lets see the histogram plot.

Histogram Plot for Sentence Length

9.3. Check the training data where number of words are less than 10

Training Data where number of words are less than 10

9.4. Histogram of length of tokenized sentence (# of tokens created by tokenizer)

Compute Histogram Plot for number of tokens

Lets see the histogram plot.

Histogram Plot for tokenized sentence length

9.5. Histogram of difference in length between sentence length and bert tokenized sentence length

Compute Histogram for difference in length

Lets see the histogram plot.

Histogram Plot for difference in length

10. Drop the newly created columns for lengths

Training Data with necessary columns required for modeling

11. Compute label Maps

Create two maps:

  1. label2id - maps string labels to its integer index
  2. id2label - maps integer index of labels to strings

Label2Id Map

E. Explore distributions of Labels

This section holds significant importance. Given that our dataset is characterized by both multi-label attributes and an imbalance, it is crucial for us to understand the distribution of each label.

1. Create a new dataframe to explore distributions of labels

2. Descriptive Statistics of each label

Descriptive Statistics (Note: This gives some information about the spread using qunatiles)

3. Frequency of each label

Compute Frequency Plot

Lets see the bar frequency plot.

Frequency Plot

F. Stratified (or Balanced) Split

Stratified sampling is a statistical technique employed to guarantee that particular subgroups within a population are sufficiently represented in a sample. This method is especially beneficial when dealing with a heterogeneous population, which comprises various groups that may exhibit differing characteristics or behaviors. By segmenting the population into separate subpopulations, referred to as strata, researchers can achieve more precise and dependable estimates of the overall parameters of the population.

1. Helper Functions to Create Balanced Split function

Helper Function for Stratified Split

2. Spliting the train data into training + validation data

The dataset will be partitioned into training, validation, ensuring that the distribution of labels is maintained through the use of stratified sampling. The first phase entails categorizing the data into two groups: seen and unseen, with the unseen data allocated as the testing set. Following this, the seen data will be subdivided into training and validation sets. It is crucial to recognize that the seen data represents the information to which the model is exposed before it is implemented in a production setting.

Note: In our case, we already have testing dataset.

Data Splitting Strategy

2. a) All Data - Train Data(90%) + Validation Data (10%)

(Original Training Data =)Seen Data = Train Data + Validation Data

G. Explore Distribution of Labels - Training, Validation

It is essential to verify if the data has been stratified appropriately. To achieve this, we will analyze the distribution of each label across all datasets. Additionally, it is crucial to ensure that the distribution of label combinations for each dataset is preserved.

Compute one hot encoding for unique labels present in a sentence

1. Check distributions of 1st order

Proportion of each label for train + valid dataset
Proportion dataframe
Computation
Distribution of labels across different datasets

2. Check distributions of 2nd order

Let us now see the distribution of two labels combined.

Distribution of 2nd order

3. Check distribution of 3rd order

Let us now see the distribution of three labels combined.

Distribution of 3rd order

H. Modelling

Let us now set the parameters to build the model.

1. Modelling Parameters

Modeling Parameters + Tokenizer

2. Preparing the dataset and dataloader

2.1. Create Custom Topic Dataset class

TopicDataset - Custom Dataset class for Topic Modeling

2.3. Create Train Dataset, Valid Dataset

Train and Valid Dataset

2.4. Create Train DataLoader and Valid DataLoader

Train and Valid Dataset

3. Create Model

We present a novel language representation model known as BERT, which stands for Bidirectional Encoder Representations from Transformers. In contrast to other recent models, BERT is specifically engineered to pre-train deep bidirectional representations from unlabeled text by simultaneously considering both left and right contexts across all layers. Consequently, the pre-trained BERT model can be easily fine-tuned by adding a single output layer, enabling the development of cutting-edge models for various tasks, including question answering and language inference, without the need for significant alterations to the task-specific architecture.

Create BERT Model
Architecture of BERT model

4. Set the Optimizer

Optimizer - AdamW:

AdamW is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update.

Learning Rate:

Learning rate is a tuning parameter in machine learning and statistics that controls how much a model's parameters adjust during each iteration of an optimization algorithm. It's a floating point number that's usually between 0.01 and 0.1

Adam Epsilon:

The parameter epsilon shows up in the update step.

Theta update equation

It is primarily used as a guard against a zero second second moment causing a division by zero case. If it is too large it will bias the moment estimation.

Set learning rate and epsilon

5. Set the Scheduler

A scheduler is to make learning rate adaptive to the gradient descent optimization procedure, so you can increase performance and reduce training time.

In PyTorch, a model is updated by an optimizer and learning rate is a parameter of the optimizer.

Learning rate schedule is an algorithm to update the learning rate in an optimizer.

Learning Scheduler

6. Create the training loop

A typical training loop in PyTorch iterates over the batches for a given number of epochs.

  • In each batch iteration, we first compute the forward pass to obtain the neural network outputs.
  • Then, we reset the gradient from the previous iteration and perform backpropagation to obtain the gradient of the loss with respect to the model weights.
  • Finally, we update the weights based on the loss gradients using stochastic gradient descent.

Since we are using learning scheduler for the adaptive learning rate, we call scheduler.step() to update the learning rate (as per the learning scheduler) after every epoch.

Function to train the model - 1st part
Function to train the model - 2nd part

7. Training Phase

Training intermediate results
Training full results for every epoch

8. Draw the graph for training loss to see its progress after every epoch

Training Loss
Training Loss for every epoch

I. Evaluation

1. Create the Evaluation Function

Function to evaluate the Model - Part 1
Function to evaluate the Model - Part 2

2. Check Performance of model on Validation Dataset

Validation Loss + Accuracy

3. Classification report

3.1. Helper function to compute classification report

Function to compute classification report

3.2. Compute the classification report

Classification Report for Validation Dataset

4. Compute Classification report at label level

Multi-label Classification Report For each Label

5. Confusion Matrix

Compute Confusion Matrix

Lets see the confusion matrix for each individual label.

Confusion Matrix for each label

J. Helper Functions to make predictions

We will create the following functions:

  1. make_prediction - Function to compute CLEAN predictions

Function to make clean prediction

K. Inference

The most advantageous aspect is the ability to swiftly evaluate the model using novel, previously unencountered sentences.

Predicting Labels on unseen text

L. Saving And loading Utilities for the Model and Tokenizer

Function to save and load the model & tokenizer

n order to serve the model generally via rest-api like FASTApi we save the trained model so that we can reuse it later.

  1. Function to save and load the model & tokenizer

Functions to save and load model + tokenizer

2. Save the trained model

Saving the model

3. Save the Tokenizer

Saving the tokenizer

4. Load the Model and Tokenizer (for future use)

Load the Saved Model + Tokenizer

5. Lets check the Loaded model and tokenizer

Results via loaded Model

Again, with our loaded model and tokenizer, the results are very promising ??


Thank you, for taking out time and reading this article. I hope you have enjoyed the code.

The link to complete Jupyter Notebook for BERT (Fine-tuning) can be found here:

BERT for Topic Modelling - Jupyter Notebook


References:

Dataset - Name Entity Recognition (NER) Dataset (KAGGLE)

Stratified Sampling - On the Stratification of Multi-label data

Ashutosh Shukla, Ph.D.

Senior AI/ML Scientist at NatWest Group (Data and Analytics)

5 个月

Very interesting article ! This article provides the clear steps to perform Topic Modelling via BERT. The detailed approach presented is simply amazing !

Piyush Kapoor

Senior Data Scientist

5 个月

Interesting

Saurabh Choudhary

BNY | Amazon | IIM Calcutta | IIT Delhi

5 个月

Very well articulated

Amit Bisai

Manager, Valuation Specialist at Deloitte, Ireland | CQF, FRM

5 个月

Loved it. Good continuation article on BERT

要查看或添加评论,请登录

Akash K.的更多文章

社区洞察

其他会员也浏览了