登录查看更多内容

BERT for Topic Modeling - Bidirectional Encoders Representation of Transformers - Part 5

Akash K.

发布日期: 2024年10月2日

In this article we will delve into the implementation of fine-tuning BERT by performing Topic Modeling using PyTorch and Hugging Face Transformer's Library.

Before you move ahead it is advisable to read Fundamentals of BERT - Part 1, Fundamentals of BERT - Part 2 to get an idea of the core principles of BERT. Further, you can read Building BERT from Scratch - Part 3 to get an idea about the pre-training phase.

In order to fine-tune BERT we will try to solve a problem- TOPIC MODELING.

Topic Modeling is a kind of sequence classification problem. It can be binary or multi-class or multi-label sequence classification.

In this article we will focus on multi-label sequence classification problem.

In Multi-label Classification, each instance (or data point) can be associated with multiple labels. For example, in text categorization, a single document might be tagged with multiple topics (e.g., "sports," "health," "politics"). - This contrasts with traditional single-label classification, where each instance is assigned only one label.

Topic Modeling for Research Articles

Topic Modeling for Research Articles is a kind of Multi-label Sequence Classification, where each sequence is a textual information.

In the digital age, researchers are inundated with a vast array of scientific literature available through numerous online repositories. This wealth of information, while beneficial, has also made the task of locating relevant articles increasingly complex and time-consuming. As the volume of published research continues to grow exponentially, traditional search methods often fall short in efficiently connecting researchers with the specific information they need. To address this challenge, innovative techniques such as tagging and topic modeling have emerged as effective solutions for organizing and retrieving research articles.

The Need for Enhanced Search Capabilities

The sheer volume of research articles available online can overwhelm even the most diligent researcher. With thousands of new papers published daily across various disciplines, the ability to quickly identify pertinent studies is crucial. Researchers often rely on keywords, abstracts, and titles to filter through this vast sea of information. However, these methods can be limited by the variability in terminology, the specificity of search queries, and the potential for missing relevant articles that may not contain the exact keywords being searched for.

What is Topic Modeling?

Topic modeling is a statistical technique used to uncover the underlying themes or topics present within a collection of documents. By analyzing the text of research articles—specifically the abstract and title—topic modeling algorithms can identify patterns and group articles based on shared themes. This process involves the use of natural language processing (NLP) and machine learning techniques to extract meaningful insights from unstructured text data.

Implementation of Topic Modeling

The implementation of topic modeling typically involves several key steps:

Data Collection: Researchers gather a substantial dataset of research articles, which may include abstracts, titles, and other relevant metadata.
Preprocessing: The text data undergoes preprocessing to clean and prepare it for analysis. This may include removing stop words, stemming or lemmatization, and tokenization.
Model Selection: Various topic modeling algorithms can be employed, such as Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), or more advanced neural network-based approaches like BERT. The choice of model depends on the specific requirements of the research and the nature of the dataset.
Training the Model: The selected model is trained on the dataset to identify topics based on the co-occurrence of words and phrases within the articles. Each article is then assigned a probability distribution over the identified topics.
Evaluation and Refinement: The results are evaluated for coherence and relevance. Researchers may refine the model by adjusting parameters or incorporating additional.

Problem Statement

Topic Modeling for Research Articles Researchers now have access to extensive online repositories of scientific literature. This abundance of information has made it increasingly challenging to locate pertinent articles. Implementing tagging or topic modeling offers a method to assign identifiers to research articles, thereby enhancing the recommendation and search processes.

By analyzing the abstract and title of a collection of research articles, the objective is to predict the topics associated with each article in the test set.

It is important to recognize that a research article may encompass multiple topics. The abstracts and titles of the research articles are derived from the following six areas of study:

Computer Science
Physics
Mathematics
Statistics
Quantitative Biology
Quantitative Finance

To implement Topic Modeling , we will utilize a Kaggle dataset that has been specifically curated for Topic Modeling task.

We will use KAGGLE DATASET - https://www.kaggle.com/datasets/vin1234/janatahack-independence-day-2020-ml-hackathon

Download Link - https://www.kaggle.com/datasets/vin1234/janatahack-independence-day-2020-ml-hackathon?select=train.csv

About Kaggle Dataset

The dataset have following columns:

ID - Unique ID for each article
TITLE - Title of the research article
ABSTRACT - Abstract of the research article
Computer Science - Whether article belongs to topic computer science (1/0)
Physics - Whether article belongs to topic physics (1/0)
Mathematics - Whether article belongs to topic Mathematics (1/0)
Statistics - Whether article belongs to topic Statistics (1/0)
Quantitative Biology - Whether article belongs to topic Quantitative Biology (1/0)
Quantitative Finance - Whether article belongs to topic Quantitative Finance (1/0)

Data Snapshot

Modeling Strategy

In the code, we will utilize BertForSequenceClassification, a model provided by the Transformers library from HuggingFace. This model is built on the BERT architecture and features a classification head, enabling it to perform predictions at the sequence level.

This article explores the concept of transfer learning, which involves initially pre-training a large neural network in an unsupervised manner, followed by fine-tuning the network for a specific task. In this instance, BERT serves as the pre-trained neural network, having been trained on two tasks: masked language modeling and next sentence prediction.

Fine-tuning involves supervised learning, which indicates that a labeled dataset is required.

Note:

This article presupposes that the reader possesses a fundamental understanding of several key concepts and technologies that are essential for effectively engaging with the content presented herein. Specifically, it assumes familiarity with deep learning, which is a subset of machine learning that focuses on algorithms inspired by the structure and function of the brain, particularly artificial neural networks.
Additionally, the reader should have a working knowledge of BERT (Bidirectional Encoder Representations from Transformers), a state-of-the-art natural language processing model developed by Google. BERT is designed to understand the context of words in a sentence by considering the words that come before and after them, making it particularly effective for tasks such as text classification, question answering, and language inference.
Furthermore, proficiency in the PyTorch framework is also assumed. PyTorch is an open-source machine learning library widely used for applications in deep learning and artificial intelligence. It provides a flexible and dynamic computational graph, which allows for easy experimentation and debugging, making it a popular choice among researchers and practitioners in the field.

In summary, this article is intended for individuals who are already well-versed in these foundational topics, as it will build upon this knowledge to explore more advanced concepts and applications related to deep learning, BERT, and the use of PyTorch for implementing various models and techniques.

This article is divided into following sections:

A. Setting Environment

B. Set the Device (cpu or mps)

C. Reading Dataset

D. Pre-Processing

E. Explore Distribution of Labels

F. Stratified Splitting

G. Explore Distribution of Labels - Training, Validation and Testing Dataset

H. Modeling (includes training Loop)

I. Evaluation

J. Helper Function to make Predictions

K. Inference

L. Saving and Loading Utilities for Model

A. Setting up the environment and importing Python libraries

Please install the following libraries to set up the environment:

pandas
numpy
sklearn
pytorch
transformers
matplotlib
seaborn
scikit-multilearn
pylatexenc

B. Set Device - Check if MPS (Metal Performance Shaders) is available

PyTorch uses the new Metal Performance Shaders (MPS) backend for GPU training acceleration. This MPS backend extends the PyTorch framework, providing scripts and capabilities to set up and run operations on Mac.

Get the device

C. Reading Dataset

Load the Topic Modeling dataset provided from a specified path into a pandas dataframe.

Peek into training data

Peek into testing data

Let us now explore the pre-processing steps.

D. Preprocessing Data

1. Raw Statistics of the Training Data + Testing Data

2. Analyze the raw training dataset to perform some basic necessary checks to validate the data

2.1 Check for duplicate IDs

Check if the ID column have unique values in raw training dataset. We are doing this step just to make sure and double check that ID do not have any duplicate values.

3. Set the target Columns

4. Convert Latex encoded text into plain text + replace newline character by a space

Generally, Research paper includes latex encoded text to create equations. So we need to decode to this encoded latex into plain text. For example $\\theta$ should get resolve into θ. Also, it can be see that there is a newline character '\n' present inside the text. So we need to remove this by a space ' '.

4.1. Lets peek into a random text

4.2. Convert latex text to plain text

5. Set the feature 'text' column

In our research paper, we incorporate both a title and an abstract, each presented in a textual format. The title serves as a concise representation of the main topic or focus of the research, while the abstract provides a brief summary of the key objectives, methodologies, findings, and implications of the study. To enhance our analysis and improve the effectiveness of our feature extraction process, we will merge these two elements into a single, cohesive set of textual data.

This unified dataset will allow us to leverage the combined information from both the title and the abstract, facilitating a more comprehensive understanding of the research content. By treating the merged text as a singular feature, we aim to capture the essence of the research more effectively, which can be beneficial for various applications such as text classification, information retrieval, and machine learning models. This approach not only streamlines our data processing but also enriches the contextual information available for further analysis.

5.1. Peek into the text column

6. Further preprocess the text using the following function to make the text cleaner

Advanced Preprocessing and Cleaning Function

6.1. Apply the preprocessing function

6.2. Peek into a random row

7. Drop the unwanted columns

Let us now drop the columns 'TITLE' and 'ABSTRACT' as they are no longer needed.

7.1. Verify if the columns are indeed dropped

8. Create additional columns to get an idea of sentence length, Bert tokens length

First we will set the tokenizer to be used which is a Bert-Base-Uncased tokenizer. We will try to get an idea of the number of words in sentence, number of tokens produce for every sentence after tokenization and the differences between them.

BERT uses Word-piece algorithm to tokenize a sentence. WordPiece is a subword-based tokenization algorithm and so it can split a word into multiple tokens. Hence number of tokens will always be greater than or equal to number of words for every sentence.

8.5. Peek at a random row

9. Lets see the distributions of number of words, number of tokens and their difference

9.1. Histogram of number of words (a.k.a sentence length)

Compute Histogram for number of words in a text

Lets see the histogram plot.

9.3. Check the training data where number of words are less than 10

9.4. Histogram of length of tokenized sentence (# of tokens created by tokenizer)

Compute Histogram Plot for number of tokens

Lets see the histogram plot.

Histogram Plot for tokenized sentence length

9.5. Histogram of difference in length between sentence length and bert tokenized sentence length

Compute Histogram for difference in length

Lets see the histogram plot.

10. Drop the newly created columns for lengths

Training Data with necessary columns required for modeling

11. Compute label Maps

Create two maps:

领英推荐

10 Best AI Frameworks for Developers

TETRAHED INC 1 年前

The Indispensable NumPy:

Africa Data School 2 个月前

BASIC MATHEMATICS FOR DEEP LEARNING (AI) PART 2 -…

Saify Technologies 2 年前

label2id - maps string labels to its integer index
id2label - maps integer index of labels to strings

E. Explore distributions of Labels

This section holds significant importance. Given that our dataset is characterized by both multi-label attributes and an imbalance, it is crucial for us to understand the distribution of each label.

1. Create a new dataframe to explore distributions of labels

2. Descriptive Statistics of each label

Descriptive Statistics (Note: This gives some information about the spread using qunatiles)

3. Frequency of each label

Lets see the bar frequency plot.

F. Stratified (or Balanced) Split

Stratified sampling is a statistical technique employed to guarantee that particular subgroups within a population are sufficiently represented in a sample. This method is especially beneficial when dealing with a heterogeneous population, which comprises various groups that may exhibit differing characteristics or behaviors. By segmenting the population into separate subpopulations, referred to as strata, researchers can achieve more precise and dependable estimates of the overall parameters of the population.

1. Helper Functions to Create Balanced Split function

2. Spliting the train data into training + validation data

The dataset will be partitioned into training, validation, ensuring that the distribution of labels is maintained through the use of stratified sampling. The first phase entails categorizing the data into two groups: seen and unseen, with the unseen data allocated as the testing set. Following this, the seen data will be subdivided into training and validation sets. It is crucial to recognize that the seen data represents the information to which the model is exposed before it is implemented in a production setting.

Note: In our case, we already have testing dataset.

2. a) All Data - Train Data(90%) + Validation Data (10%)

(Original Training Data =)Seen Data = Train Data + Validation Data

G. Explore Distribution of Labels - Training, Validation

It is essential to verify if the data has been stratified appropriately. To achieve this, we will analyze the distribution of each label across all datasets. Additionally, it is crucial to ensure that the distribution of label combinations for each dataset is preserved.

Compute one hot encoding for unique labels present in a sentence

1. Check distributions of 1st order

Proportion of each label for train + valid dataset

Distribution of labels across different datasets

2. Check distributions of 2nd order

Let us now see the distribution of two labels combined.

3. Check distribution of 3rd order

Let us now see the distribution of three labels combined.

H. Modelling

Let us now set the parameters to build the model.

1. Modelling Parameters

2. Preparing the dataset and dataloader

2.1. Create Custom Topic Dataset class

TopicDataset - Custom Dataset class for Topic Modeling

2.3. Create Train Dataset, Valid Dataset

2.4. Create Train DataLoader and Valid DataLoader

3. Create Model

We present a novel language representation model known as BERT, which stands for Bidirectional Encoder Representations from Transformers. In contrast to other recent models, BERT is specifically engineered to pre-train deep bidirectional representations from unlabeled text by simultaneously considering both left and right contexts across all layers. Consequently, the pre-trained BERT model can be easily fine-tuned by adding a single output layer, enabling the development of cutting-edge models for various tasks, including question answering and language inference, without the need for significant alterations to the task-specific architecture.

4. Set the Optimizer

Optimizer - AdamW:

AdamW is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update.

Learning Rate:

Learning rate is a tuning parameter in machine learning and statistics that controls how much a model's parameters adjust during each iteration of an optimization algorithm. It's a floating point number that's usually between 0.01 and 0.1

Adam Epsilon:

The parameter epsilon shows up in the update step.

It is primarily used as a guard against a zero second second moment causing a division by zero case. If it is too large it will bias the moment estimation.

5. Set the Scheduler

A scheduler is to make learning rate adaptive to the gradient descent optimization procedure, so you can increase performance and reduce training time.

In PyTorch, a model is updated by an optimizer and learning rate is a parameter of the optimizer.

Learning rate schedule is an algorithm to update the learning rate in an optimizer.

6. Create the training loop

A typical training loop in PyTorch iterates over the batches for a given number of epochs.

In each batch iteration, we first compute the forward pass to obtain the neural network outputs.
Then, we reset the gradient from the previous iteration and perform backpropagation to obtain the gradient of the loss with respect to the model weights.
Finally, we update the weights based on the loss gradients using stochastic gradient descent.

Since we are using learning scheduler for the adaptive learning rate, we call scheduler.step() to update the learning rate (as per the learning scheduler) after every epoch.

7. Training Phase

8. Draw the graph for training loss to see its progress after every epoch

I. Evaluation

1. Create the Evaluation Function

2. Check Performance of model on Validation Dataset

3. Classification report

3.1. Helper function to compute classification report

3.2. Compute the classification report

Classification Report for Validation Dataset

4. Compute Classification report at label level

Multi-label Classification Report For each Label

5. Confusion Matrix

Lets see the confusion matrix for each individual label.

J. Helper Functions to make predictions

We will create the following functions:

make_prediction - Function to compute CLEAN predictions

K. Inference

The most advantageous aspect is the ability to swiftly evaluate the model using novel, previously unencountered sentences.

L. Saving And loading Utilities for the Model and Tokenizer

Function to save and load the model & tokenizer

n order to serve the model generally via rest-api like FASTApi we save the trained model so that we can reuse it later.

Function to save and load the model & tokenizer

Functions to save and load model + tokenizer

2. Save the trained model

3. Save the Tokenizer

4. Load the Model and Tokenizer (for future use)

5. Lets check the Loaded model and tokenizer

Again, with our loaded model and tokenizer, the results are very promising ??

Thank you, for taking out time and reading this article. I hope you have enjoyed the code.

The link to complete Jupyter Notebook for BERT (Fine-tuning) can be found here:

BERT for Topic Modelling - Jupyter Notebook

References:

Dataset - Name Entity Recognition (NER) Dataset (KAGGLE)

Stratified Sampling - On the Stratification of Multi-label data

Ashutosh Shukla, Ph.D.

Senior AI/ML Scientist at NatWest Group (Data and Analytics)

5 个月

Very interesting article ! This article provides the clear steps to perform Topic Modelling via BERT. The detailed approach presented is simply amazing !

1 次回应

Piyush Kapoor

Senior Data Scientist

5 个月

Interesting

1 次回应

Saurabh Choudhary

BNY | Amazon | IIM Calcutta | IIT Delhi

5 个月

Very well articulated

1 次回应

Amit Bisai

Manager, Valuation Specialist at Deloitte, Ireland | CQF, FRM

5 个月

Loved it. Good continuation article on BERT

1 次回应

查看更多评论

要查看或添加评论，请登录

Akash K.的更多文章

FineTuning BERT- Named Entity Recognition - Bidirectional Encoders Representation of Transformers - Part 4

2024年9月27日

FineTuning BERT- Named Entity Recognition - Bidirectional Encoders Representation of Transformers - Part 4

In this article we will delve into the implementation of fine-tuning BERT with PyTorch through code. Before you move…

16 条评论
Building BERT From Scratch - Bidirectional Encoders Representation of Transformers - Part 3

2024年7月19日

Building BERT From Scratch - Bidirectional Encoders Representation of Transformers - Part 3

In this article we will delve into the implementation of BERT with PyTorch through code. Before you move ahead it is…

3 条评论
Fundamentals of RAG - Retrieval Augmented Generation - Part 1

2024年6月16日

Fundamentals of RAG - Retrieval Augmented Generation - Part 1

Retrieval Augmented Generation (RAG) is an innovative approach that combines the power of retrieval-based models and…

4 条评论
Fundamentals of BERT- Bidirectional Encoders Representations from Transformers, Part-2

2024年6月12日

Fundamentals of BERT- Bidirectional Encoders Representations from Transformers, Part-2

In this article we will explore the concepts of Fine-Tuning. Before you move ahead it is advisable to read Fundamentals…

2 条评论
Fundamentals of BERT - Bidirectional Encoders Representations from Transformers, Part-1

2024年6月11日

Fundamentals of BERT - Bidirectional Encoders Representations from Transformers, Part-1

In this article we will explore the fundamental concepts of BERT. Before we deep delve into this further it is highly…

1 条评论
Symmetric Quantization - Quantization of LLMs, Part-4

2024年6月5日

Symmetric Quantization - Quantization of LLMs, Part-4

In part-3, we explored the concept of Affine Quantization. In this part we will focus on Symmetric Quantization.

1 条评论
Fundamentals of Quantization - Quantization of LLMs, Part-3

2024年5月27日

Fundamentals of Quantization - Quantization of LLMs, Part-3

In the first part (part-1), we observed that the majority of central processing units (CPUs) employ the 2’s complement…
BigNum Arithmetic - Quantization of LLMs, Part-2

2024年5月15日

BigNum Arithmetic - Quantization of LLMs, Part-2

We saw in part-1, that most central processing units (CPUs) utilize the 2’s complement to represent integers. In this…

1 条评论
Number System - Quantization of LLMs, Part-1

2024年5月14日

Number System - Quantization of LLMs, Part-1

Large Language Models (LLMs) have significantly advanced in recent years, becoming increasingly user-friendly and…
METEOR - Evaluation of Large Language Models Part-4a

2024年5月7日

METEOR - Evaluation of Large Language Models Part-4a

In 2005, Alon Lavie and Satanjeev Banerjee created METEOR with the goal of surpassing BLEU and ROUGE through the…

2 条评论

See all articles

Topic Modeling for Research Articles

Problem Statement

About Kaggle Dataset

Modeling Strategy

A. Setting up the environment and importing Python libraries

B. Set Device - Check if MPS (Metal Performance Shaders) is available

C. Reading Dataset

D. Preprocessing Data

1. Raw Statistics of the Training Data + Testing Data

2. Analyze the raw training dataset to perform some basic necessary checks to validate the data

3. Set the target Columns

4. Convert Latex encoded text into plain text + replace newline character by a space

5. Set the feature 'text' column

6. Further preprocess the text using the following function to make the text cleaner

7. Drop the unwanted columns

8. Create additional columns to get an idea of sentence length, Bert tokens length

9. Lets see the distributions of number of words, number of tokens and their difference

10. Drop the newly created columns for lengths

11. Compute label Maps

领英推荐

E. Explore distributions of Labels

1. Create a new dataframe to explore distributions of labels

2. Descriptive Statistics of each label

3. Frequency of each label

F. Stratified (or Balanced) Split

1. Helper Functions to Create Balanced Split function

2. Spliting the train data into training + validation data

G. Explore Distribution of Labels - Training, Validation

1. Check distributions of 1st order

2. Check distributions of 2nd order

3. Check distribution of 3rd order

H. Modelling

1. Modelling Parameters

2. Preparing the dataset and dataloader

3. Create Model

4. Set the Optimizer

5. Set the Scheduler

6. Create the training loop

7. Training Phase

8. Draw the graph for training loss to see its progress after every epoch

I. Evaluation

1. Create the Evaluation Function

2. Check Performance of model on Validation Dataset

3. Classification report

4. Compute Classification report at label level

5. Confusion Matrix

J. Helper Functions to make predictions

K. Inference

L. Saving And loading Utilities for the Model and Tokenizer

Akash K.的更多文章

FineTuning BERT- Named Entity Recognition - Bidirectional Encoders Representation of Transformers - Part 4

Building BERT From Scratch - Bidirectional Encoders Representation of Transformers - Part 3

Fundamentals of RAG - Retrieval Augmented Generation - Part 1

Fundamentals of BERT- Bidirectional Encoders Representations from Transformers, Part-2

Fundamentals of BERT - Bidirectional Encoders Representations from Transformers, Part-1

Symmetric Quantization - Quantization of LLMs, Part-4

Fundamentals of Quantization - Quantization of LLMs, Part-3

BigNum Arithmetic - Quantization of LLMs, Part-2

Number System - Quantization of LLMs, Part-1

METEOR - Evaluation of Large Language Models Part-4a

社区洞察

其他会员也浏览了

BASIC MATHEMATICS FOR DEEP LEARNING (AI) PART 2 - FUNCTION, GRAPHS, LINEAR EQUATIONS:

24 Algorithms & Data Structures you need to know for Finance

How to build Image Classifier from scratch using Python and TensorFlow

Reasons Why You Will Need Linear Algebra as a Data Scientist

Performance Improvement in Numpy 2.x

Object Fractal Dimension

Finding Relevance in Object-Oriented Abstraction of the Real World and Mathematical Models in Machine Learning

#3. Math for ML Part 1: Linear Algebra

Implementing LSTM with TensorFlow and Python

Mastering Model Fine-Tuning in PyTorch: A Comprehensive Guide