Fundamentals of BERT - Bidirectional Encoders Representations from Transformers, Part-1

Akash K.

发布日期: 2024年6月11日

In this article we will explore the fundamental concepts of BERT. Before we deep delve into this further it is highly advisable to read Visualization of Mathematical Engineering of Transformers Part-1 and Part-2 to get a better sense of the basics of Transformers.

What is a Language Model ?

A language model is a probabilistic model that assigns probabilities to sequence of words. In practice, a language model allows us to compute the following:

The language model allows us to predict the probability of the next word in the sequence. For example, the probability of word "India" following the sequence "Delhi is a city in".

The probability that the word "India" comes next in the sequence "Delhi is a city in" is the kind of probability that we model using language models which is a Neural Network trained on a large corpora of text.

We usually train a Neural Network to predict these probabilities. A Neural Network trained on a very large corpora of text is known as a Large Language Model (LLM).

Why do we use vectors to represent words?

Given the words "apple", "bits" and "information, if we represent the embedding vectors using only 2 dimensions (X,Y) and we plot them, we hope to se something like this: the angle between words with similar meaning is small, while the angle between words with different meaning is big. So, the embedding "capture" the meaning of the words they represent by projecting them into a high-dimensional space of size d_model (embedding size).

Similarity as a measure of angle between tokens

For example, the words "bits" and "information" will point to the similar directions in space as they both capture the same kind of semantic meaning and we can measure the similarity by measuring the angle between them. Hence, the angle between "bits" and "information" is very small, while the angle between "apple" and " bits" is very large because they represent different semantic groups so they have different meaning.

Imagine there is another word called "mango", then the word "mango" will point to a direction very similar to the "apple" direction and hence the angle between "apple" and "mango" will be very small as they both fall into the same semantic group - fruits. We measure the angle between vectors using cosine similarity which is the dot product between the two vectors.

We commonly use the cosine similarity, which is based on the dot product between the two vectors

Let us now review the self-attention mechanism and the casual mask

Self-Attention Mechanism

Self-Attention allows the model to relate words with every other word in the sentence. In vanilla Transformers, we use d_k = d_model = 512 (embedding size)

The Self-Attention is computed using the following formula:

In order to compute self attention we need to do the following three steps:

Compute scaled dot product matrix
Apply softmax on the scaled dot product along it's rows
Multiply soft scaled dot product with values V

Let us see the implementation of all the three steps to compute attention.

Consider an input sentence - the sun is very bright

During preprocessing, we generally add a special token [SOS] to indicate the start of sentence. Hence,

Tokenized Input Sentence - [SOS], the, sun, is, very, bright

1. Computation of Scaled Dot Product

2. Apply Softmax to Scaled Dot Product

3. Multiply soft scaled dot product with values V

Self-Attention Mechanism - the reason behind the casual mask

As we saw earlier, a language model is a probabilistic model that assign probabilities to sequence of words and it allows us to compute the following:

i.e. we want to compute the probability of the word "India" being the next word in the sequence - "Delhi is a city in".

Next word probability is based only on left context

To model the probability distribution above, each word should only depend on words that come before it (left context).

So we want to condition the word "India" only on the words that come before it i.e. "Delhi is a city in". So our model should only be able to watch the left part of the sentence to predict the next token which is also called the left context. We achieve this by introducing casual mask.

Self-Attention Mechanism - Casual Mask

For every word or token we want to have zero interaction with their corresponding future words (or right context) i.e. the next word should only depend on their left context and not on the words present in their right context. Hence the values in the upper triangular matrix of soft scaled dot product matrix.

The following figure clearly illustrates the scenario:

Representation of Soft Scaled Dot Product if Casual Mask is present

The above soft scaled dot product can be achieved if we replace the values of the upper triangular matrix of scaled dot product matrix with – ∞, before applying the softmax operation.

The following figure clearly illustrate the scenario:

Representation of Scaled Dot Product - Casual Mask

But this is not what is happening in normal Self-Attention because we are able to relate tokens that comes in the future with tokens that comes in the past.

We will see later that in BERT we make use of both the left and the right context.

Let us now explore the BERT model.

BERT

BERT stands for Bidirectional Encoder Representations from Transformers.

Developed in 2018 by Google researchers, BERT is one of the first LLMs. The introduction of BERT brought about a notable progress in the realm of encoder-only transformer architecture. This particular architecture solely comprises multiple layers of bidirectional self-attention and a feed-forward transformation, each accompanied by a residual connection and layer normalization.

The goal of any given NLP technique is to understand human language as it is spoken naturally. BERT's primary technical advancement involves the utilization of bidirectional training from the Transformer, a widely used attention model, for language modeling.

Unlike directional models that read the text input in a sequential manner, either from left-to-right or right-to-left, the Transformer encoder takes in the entire sequence of words at once. Hence, it is classified as bidirectional, although it would be more precise to describe it as non-directional. This attribute enables the model to comprehend the context of a word by considering all the words surrounding it, both to the left and right.

Conventional language models are typically directional models and hence they analyze text in a linear fashion, moving either from left to right or right to left. This approach restricts the model's understanding to only the nearby context before the specific word being examined. In contrast, BERT employs a bidirectional strategy that takes into account both the preceding and following context of words within a sentence. Rather than processing text in a sequential manner, BERT simultaneously considers all words in a sentence.

By examining the entire sentence as a whole rather than analyzing it sequentially, BERT is able to gain a comprehensive understanding of all the words in the sentence.

BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

BERT's architecture is made up of layers of encoders of the Transformer model.

BERT was introduced with two pre-trained models:

BERT Base
Bert Large

Differences with the Vanilla Transformers:

The size of embedding vector is 768 for base model and 1024 for large model
Positional embeddings are absolute and learnt during training and limited to 512 positions
The linear layer head changes according to the task-specific application

For tokenization, BERT use the WordPiece tokenizer, which allows sub-words tokens. The vocabulary size used is approximately ~ 30,000 tokens.

WordPiece is a subword-based tokenization algorithm. It was first outlined in the paper “Japanese and Korean Voice Search (Schuster et al., 2012)”. The algorithm gained popularity through the famous state-of-the-art model BERT.

BERT's framework involves two-steps process:

Pre-training - During pre-training, the model is trained on unlabeled data over different pre-training tasks.
Fine-tuning - the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks.

Pre-Training

Unlike conventional language models that analyze text sequentially, moving either from left to right or right to left, BERT is pre-trained using the following two unsupervised tasks:

Masked Language Modeling
Next Sentence Prediction

Before we deep delve into the concept of Masked Language Modeling, let us explore the importance of the left and the tight context.

Bidirectional Importance

Traditional language models typically process text in a linear manner, either from left to right or right to left. However, this sequential approach restricts the model's understanding to only the immediate context before the target word. In contrast, BERT employs a bidirectional approach that takes into account both the left and right context of words within a sentence. By examining the entire sentence as a whole rather than analyzing it sequentially, BERT is able to gain a comprehensive understanding of all the words in the sentence.

Importance of the Left Context

Let us analyze the telephonic conversation between an operator and a user for the understanding the importance of the left context.

It is evident that the Operator formulates responses based on the User's input. The User initiates the conversation, prompting a reply from the Operator, which in turn leads to a response from the User based on the Operator's previous statement. This process is known as utilizing the left context.

Importance of the Right Context

Consider a child who accidentally damages his teacher's cherished pen and is reluctant to confess the truth, opting instead to fabricate a lie.

As evident, the child bases the fabrication of his lie on what he intends to express subsequently. Regardless of the falsehood he fabricates, it will always be contingent upon the outcome he desire to reach (the pens being broken). And so this could be an intuition how we human use the right context.

1. Masked Language Modeling (MLM)

It is logical to assume that a deep bidirectional model is inherently more capable than either a left-to-right model or the shallow combination of a left-to-right and a right-to-left model.

Typically, a language model is constructed by training it on an unrelated task that aids in building a contextual comprehension of words within the model. Frequently, these tasks involve predicting the subsequent word or words in close proximity to one another. However, these training techniques cannot be applied to bidirectional models as they would inadvertently enable each word to indirectly anticipate itself. When encountering the same sentence from the opposite direction, there would already be a sense of familiarity and expectation.

In this scenario, the model can easily predict the target word. Moreover, there is no assurance that the fully trained model has grasped the contextual meaning of the words to a certain degree, rather than solely concentrating on optimizing the straightforward predictions.

So how does BERT manage to pre-train bidirectionally?

In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random (i.e. replace with [MASK] token), and then predict these masked tokens - not the entire input sequence. We refer to this procedure as a Masked Language Modeling (MLM).

It is also known as the Cloze task. It means that randomly selected words in a sentence are masked, and the model must predict the correct word given the left and right context.

For Example, consider an input sentence - "Delhi, being the capital of India, is home to numerous government offices". Assuming that the randomly selected word is "capital," we will proceed to mask it. The following figure clearly illustrate the masking:

BERT’s job is to figure out what these masked words are by looking at the words around them. It’s like a game of guessing where some words are missing, and BERT tries to fill in the blanks.

BERT adds a special layer on top of its learning system to make these guesses. It then checks how close its guesses are to the actual hidden words. It does this by converting its guesses into probabilities.

BERT prediction in Masked Language Modeling

领英推荐

Do the Laws of Computation Imply That We Will Never…

Mario Schlosser 5 个月前

Real-Time Prediction

OneNine AI 2 年前

GenAI Core Topics Explained in Simple Pictures

Vincent Granville 11 个月前

How does the BERT model compute these predicted probabilities?

In technical terms, the prediction of the output words requires:

Adding a classification layer on top of the encoder output.
Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
Calculating the probability of each word in the vocabulary with softmax.

The BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked tokens.

Left and Right context in BERT

BERT uses bidirectional self-attention (instead of masked self-attention), the model can look at the entire sequence both before and after the masked token to make a prediction.

Let's see how BERT uses the left and the right context.

In contrast to previous methods, the current approach to computing attention does not involve the use of a casual mask to eliminate the interaction of words with subsequent words. For instance, in masked self-attention, we used to replace the values in the upper triangular matrix of the scaled dot product matrix with minus infinity to prevent interactions with future words. However, in BERT, we do not employ any casual mask. This implies that each token in a sentence attends to both the tokens on its left and the tokens on its right.

MLM - Twisted Masking Procedure of Input Tokens

The key element to achieving bidirectional learning in BERT (and every LLM based on transformers) is the attention mechanism. This mechanism is based on masked language modeling (MLM). By masking a word in a sentence, this technique forces the model to analyze the remaining words in both directions in the sentence to increase the chances of predicting the correct masked word.

Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. This is mitigated by a subtle twist in how we mask the input tokens.

Approximately 15% of the words are masked at random while pre-training, but all of the masked words are not replaced by the [MASK] token.

If the i-th token is chosen at random, we replace the i-th token

80% of the time with [MASK] token
10% of the time with a random token
10% of the time with the unchanged i-th input token

Lets review an example to get a better sense of masking procedure mentioned above.

Consider an input sentence - "Delhi, being the capital of India, is home to numerous government offices."

Assuming that the randomly selected token to be masked is "capital", the following figure clearly illustrates the masking process:

Example illustrating the detailed mechanism of MLM

Let us now visualize the whole training process of Masked Language Modeling (MLM) with an example.

MLM - Training Visualization

Input Sentence - "Delhi, being the capital of India, is home to numerous government offices."

Assuming the randomly selected word to be masked is "capital" as before, this will lead to a masked input sequence containing 14 tokens that will be inputted into BERT. Given that BERT is a transformer model, it will generate an output sequence of 14 tokens due to the input sequence consisting of 14 tokens.

BERT will specifically focus on the masked position, which is the 4th position (i.e. token-3, with index = 3). Subsequently, we will compare token-3 (at the 4th position) with the target token, "capital", and calculate the loss. Finally, we will run the back-propagation process to update the weights.

Essentially, our objective is for the resulting token, 'token-3' to represent 'capital'.

Masked Language Modeling - Training Visualization

Let us now explore the 2nd task of pre-training procedure - Next Sentence Prediction

2. Next Sentence Prediction (NSP)

Numerous crucial downstream tasks, like Question Answering (QA) and Natural Language Inference (NLI), rely on comprehending the connection between two sentences, a factor not explicitly addressed by language modeling.

For example, in Question Answering, the ability to accurately determine how a question relates to a given passage of text is essential for providing a relevant and accurate answer. Similarly, in Natural Language Inference, the ability to discern the logical relationship between two sentences, such as whether one sentence contradicts, entails, or is neutral with respect to the other, is crucial for making accurate inferences and drawing conclusions.

In order to train a model that understands sentence relationships, we pre-train for a Binarized Next Sentence Prediction task that can be trivially generated from any monolingual corpus.

In this Next Sentence prediction pre-training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document.

In this task, two sentences - A and B - are chosen, for every pre-training instance, as follows:

50% of the time B is the actual next sentence that follows A (labelled as IsNext)
50% of the time B is a random sentence from the corpus (labelled as NotNext)

The assumption is that the random sentence will be disconnected from the first sentence.

We will analyze a very simple example in order to get a feeling of this task. The example is simple in a sense that we will not use special tokens just for the illustration purpose.

Lets say we have a corpus of a famous nursery rhyme - "Jack and Jill" as follows:

Now let us construct two pre-training instances (for NSP task) as follows:

+ve and -ve Pre-training instances for NSP

Let us feed the two pre-training instances into the BERT model and check the predictions.

BERT Prediction for pre-training instance-1 and instance-2

Now a natural question to ask is that how does a BERT model predicted the label probabilities? To get an answer, first we need to the know clever usage of Input representations used by BERT.

BERT- Input Representation

In order to enable BERT to effectively handle a range of down-stream tasks, our input representation has the capability to clearly represent either a single sentence or a pair of sentences within a single input token sequence.

A "sentence" may consist of any continuous text, regardless of whether it forms a proper linguistic sentence.

A "sequence" denotes the input token sequence given to BERT, which could consist of either a single sentence or two sentences combined.

There are two problems in the BERT's input:

Problem-1: In contrast to RNNs, where inputs are fed sequentially, all the inputs are simultaneously fed in one step in this model. However, the model is unable to retain the ordering of the input tokens. It is important to note that the order of words holds significance in every language, both in terms of meaning and syntax.

Problem-2: To effectively carry out the Next Sentence Prediction task, it is essential to differentiate between sentences A and B.

The solution to both of these problems involves incorporating embeddings that include the necessary information into our original tokens, then utilizing the outcome as the input for our BERT model.

The following embeddings are added to token embeddings:

Segment Embedding - They provide information about the sentence that a specific token belongs to.
Position Embedding - They provide information about the order of words in the input sequence.

To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:

A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
A segment embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embeddings with a vocabulary size of 2.
A positional embedding is added to each token to indicate its position in the input sequence.

Let us now see few examples including the use of special tokens [CLS] and [SEP] tokens.

NSP - Examples

The next sentence prediction task can be illustrated in the following examples.

NSP - Training Samples (+ve and -ve) including special tokens

How does the BERT model predicts if the second sentence (B) follows the first sentence (A) with their probabilities?

To predict if the second sentence is indeed connected to the first, the following steps are performed:

The entire input sequence goes through the Transformer (encoder) model.
The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simple classification layer (FFNN)
Compute the probability of IsNext with SoftMax

Let us now visualize the training procedure of Next Sentence Prediction with the help of an example.

NSP - Training Visualization

Input Sentence - "The man went to the store. He bought gallon of milk."

Assuming the randomly selected word to be masked is "the" in the first sentence - "The man went to the store." and word "of" in the second sentence - "He bought gallon of milk."

After pre-processing, the input sequence including special tokens will be the following:

Input Sequence - [CLS] the man went to [MASK] store [SEP] he bought gallon [MASK] milk

The input sequence is now inputted into BERT. Given that BERT is a transformer model, it will generate an output sequence of 14 tokens due to the input sequence consisting of 14 tokens.

We will only consider the first token of the output, which is token-0, corresponding to the first input [CLS] token. Token-0 will then be fed into a linear layer with two output features, IsNext and NotNext, followed by the application of softmax. We will compare token-0 to the target IsNext, expecting BERT to predict IsNext as the pair of sentences inputted are connected.

Next, we will compute the loss using cross entropy loss and proceed with the back propagation process to update the weights. This is how we trained BERT on the Next Sentence Prediction task.

Importance of [CLS] token in BERT

Lets review how the [CLS] tokens works.

The [CLS] token always interact with all the other tokens because we didn't apply any casual mask. So we can consider that the [CLS] token acts as a token that captures the information from all the other tokens because the normal attention matrix didn't applies any casual masking before applying the softmax. So all of the attention values will be actually learned by the model and this is idea behind the [CLS] token.

[CLS] token interacting with all the tokens

If we perform matrix multiplication between the Soft Scaled Dot product and the Values V matrix, we will obtain the Attention Output matrix. However, as there are no zero values in the Soft Scaled Dot Product Matrix, the CLS token in the first row of the Attention Output matrix will have access to the attention scores of all the tokens. In essence, the [CLS] token in the attention output matrix (corresponding to the first row) will consolidate all the attention scores and relationships with all the tokens.

[CLS] acting as an aggregator of information in the Attention Output Matrix

CEO - Investor Analogy

The [CLS] can be thought of as a chief executive officer (CEO) in a company, with you playing the role of an investor. As an investor, you don't seek information from individual employees; instead, you approach the CEO. Similarly, the [CLS] is responsible for gathering the required information from every word of the sentence in order to achieve the desired outcome. It serves as the aggregator of all the information within the sentence, enabling us to classify it. Hence, it is referred to as [CLS].

BERT Pre-Training

When training the BERT model, both Masked LM and Next Sentence Prediction tasks are trained simultaneously with the goal of minimizing the combined loss function of the two tasks.

This results in a language model that possesses improved capabilities in comprehending the context within sentences and the connections between them. This approach ultimately leads to the development of a robust language model.

But how does one predict output for two different tasks simultaneously?

BERT Output

How can the output for two distinct tasks be predicted simultaneously?

To obtain the solution, one can utilize a distinct FFNN + Softmax layer constructed on the basis of the outputs derived from the last encoder, which correspond to the desired input tokens. The outputs from the final encoder will be referred to as the final states.

The first input token is always a special classification [CLS] token. The final state associated with this token is utilized as the comprehensive (or aggregate) sequence representation for classification assignments and is employed for the Next Sentence Prediction. In this prediction, it is inputted into a FFNN + Softmax layer that calculates probabilities for the labels "IsNext" or "NotNext".

The final states corresponding to [MASK] tokens is fed into FFNN+Softmax to predict the masked word.

Note: Here FFNN is an acronym for Feed Forward Neural Network.

To be continued in Part-2, where we will explore the concept of Fine-Tuning.

References:

- Vaswani et al. (2017), Attention is all you need

Piyush Kapoor

Senior Data Scientist

9 个月

Very useful

1 次回应

要查看或添加评论，请登录

Akash K.的更多文章

BERT for Topic Modeling - Bidirectional Encoders Representation of Transformers - Part 5

2024年10月2日

BERT for Topic Modeling - Bidirectional Encoders Representation of Transformers - Part 5

In this article we will delve into the implementation of fine-tuning BERT by performing Topic Modeling using PyTorch…

4 条评论
FineTuning BERT- Named Entity Recognition - Bidirectional Encoders Representation of Transformers - Part 4

2024年9月27日

FineTuning BERT- Named Entity Recognition - Bidirectional Encoders Representation of Transformers - Part 4

In this article we will delve into the implementation of fine-tuning BERT with PyTorch through code. Before you move…

16 条评论
Building BERT From Scratch - Bidirectional Encoders Representation of Transformers - Part 3

2024年7月19日

Building BERT From Scratch - Bidirectional Encoders Representation of Transformers - Part 3

In this article we will delve into the implementation of BERT with PyTorch through code. Before you move ahead it is…

3 条评论
Fundamentals of RAG - Retrieval Augmented Generation - Part 1

2024年6月16日

Fundamentals of RAG - Retrieval Augmented Generation - Part 1

Retrieval Augmented Generation (RAG) is an innovative approach that combines the power of retrieval-based models and…

3 条评论
Fundamentals of BERT- Bidirectional Encoders Representations from Transformers, Part-2

2024年6月12日

Fundamentals of BERT- Bidirectional Encoders Representations from Transformers, Part-2

In this article we will explore the concepts of Fine-Tuning. Before you move ahead it is advisable to read Fundamentals…

2 条评论
Symmetric Quantization - Quantization of LLMs, Part-4

2024年6月5日

Symmetric Quantization - Quantization of LLMs, Part-4

In part-3, we explored the concept of Affine Quantization. In this part we will focus on Symmetric Quantization.
Fundamentals of Quantization - Quantization of LLMs, Part-3

2024年5月27日

Fundamentals of Quantization - Quantization of LLMs, Part-3

In the first part (part-1), we observed that the majority of central processing units (CPUs) employ the 2’s complement…
BigNum Arithmetic - Quantization of LLMs, Part-2

2024年5月15日

BigNum Arithmetic - Quantization of LLMs, Part-2

We saw in part-1, that most central processing units (CPUs) utilize the 2’s complement to represent integers. In this…

1 条评论
Number System - Quantization of LLMs, Part-1

2024年5月14日

Number System - Quantization of LLMs, Part-1

Large Language Models (LLMs) have significantly advanced in recent years, becoming increasingly user-friendly and…
METEOR - Evaluation of Large Language Models Part-4a

2024年5月7日

METEOR - Evaluation of Large Language Models Part-4a

In 2005, Alon Lavie and Satanjeev Banerjee created METEOR with the goal of surpassing BLEU and ROUGE through the…

2 条评论

See all articles

Fundamentals of BERT - Bidirectional Encoders Representations from Transformers, Part-1

Akash K.

What is a Language Model ?

Why do we use vectors to represent words?

Self-Attention Mechanism

Self-Attention Mechanism - the reason behind the casual mask

Self-Attention Mechanism - Casual Mask

BERT

Differences with the Vanilla Transformers:

Pre-Training

Bidirectional Importance

Importance of the Left Context

Importance of the Right Context

1. Masked Language Modeling (MLM)

领英推荐

How does the BERT model compute these predicted probabilities?

Left and Right context in BERT

MLM - Twisted Masking Procedure of Input Tokens

MLM - Training Visualization

2. Next Sentence Prediction (NSP)

BERT- Input Representation

NSP - Examples

How does the BERT model predicts if the second sentence (B) follows the first sentence (A) with their probabilities?

NSP - Training Visualization

Importance of [CLS] token in BERT

BERT Pre-Training

BERT Output

Akash K.的更多文章

社区洞察

其他会员也浏览了

Beyond Simple Patterns and Predetermined Actions: The Journey from Perception to True Intelligence in AI

How Would Alan Turing React to Today’s Generative AI?

The Evolution and Future of Artificial Intelligence

The Journey of AI: Ancient Myths to the Future

Artificial Intelligence: A step towards Brighter or Darker Future?

Part 4: The Quest for Understanding Language ??

?? How to Master LLMs — Part 4: The Quest for Understanding Language ??

Demystifying LSTM Models: A Guide to Gradient-Based Sensitivity Analysis

AI Darwinism: How Algorithms Find the Best Solutions!

What is a Language Model ?

Why do we use vectors to represent words?

Self-Attention Mechanism

Self-Attention Mechanism - the reason behind the casual mask

Self-Attention Mechanism - Casual Mask

BERT

Differences with the Vanilla Transformers:

Pre-Training

Bidirectional Importance

Importance of the Left Context

Importance of the Right Context

1. Masked Language Modeling (MLM)

领英推荐

How does the BERT model compute these predicted probabilities?

Left and Right context in BERT

MLM - Twisted Masking Procedure of Input Tokens

MLM - Training Visualization

2. Next Sentence Prediction (NSP)

BERT- Input Representation

NSP - Examples

How does the BERT model predicts if the second sentence (B) follows the first sentence (A) with their probabilities?

NSP - Training Visualization

Importance of [CLS] token in BERT

BERT Pre-Training

BERT Output

Akash K.的更多文章

BERT for Topic Modeling - Bidirectional Encoders Representation of Transformers - Part 5

FineTuning BERT- Named Entity Recognition - Bidirectional Encoders Representation of Transformers - Part 4

Building BERT From Scratch - Bidirectional Encoders Representation of Transformers - Part 3

Fundamentals of RAG - Retrieval Augmented Generation - Part 1

Fundamentals of BERT- Bidirectional Encoders Representations from Transformers, Part-2

Symmetric Quantization - Quantization of LLMs, Part-4

Fundamentals of Quantization - Quantization of LLMs, Part-3

BigNum Arithmetic - Quantization of LLMs, Part-2

Number System - Quantization of LLMs, Part-1

METEOR - Evaluation of Large Language Models Part-4a

社区洞察

其他会员也浏览了

Beyond Simple Patterns and Predetermined Actions: The Journey from Perception to True Intelligence in AI

How Would Alan Turing React to Today’s Generative AI?

The Evolution and Future of Artificial Intelligence

The Journey of AI: Ancient Myths to the Future

Artificial Intelligence: A step towards Brighter or Darker Future?

Part 4: The Quest for Understanding Language ??

?? How to Master LLMs — Part 4: The Quest for Understanding Language ??

Demystifying LSTM Models: A Guide to Gradient-Based Sensitivity Analysis

AI Darwinism: How Algorithms Find the Best Solutions!