Fundamentals of BERT - Bidirectional Encoders Representations from Transformers, Part-1
In this article we will explore the fundamental concepts of BERT. Before we deep delve into this further it is highly advisable to read Visualization of Mathematical Engineering of Transformers Part-1 and Part-2 to get a better sense of the basics of Transformers.
What is a Language Model ?
A language model is a probabilistic model that assigns probabilities to sequence of words. In practice, a language model allows us to compute the following:
The language model allows us to predict the probability of the next word in the sequence. For example, the probability of word "India" following the sequence "Delhi is a city in".
The probability that the word "India" comes next in the sequence "Delhi is a city in" is the kind of probability that we model using language models which is a Neural Network trained on a large corpora of text.
We usually train a Neural Network to predict these probabilities. A Neural Network trained on a very large corpora of text is known as a Large Language Model (LLM).
Why do we use vectors to represent words?
Given the words "apple", "bits" and "information, if we represent the embedding vectors using only 2 dimensions (X,Y) and we plot them, we hope to se something like this: the angle between words with similar meaning is small, while the angle between words with different meaning is big. So, the embedding "capture" the meaning of the words they represent by projecting them into a high-dimensional space of size d_model (embedding size).
For example, the words "bits" and "information" will point to the similar directions in space as they both capture the same kind of semantic meaning and we can measure the similarity by measuring the angle between them. Hence, the angle between "bits" and "information" is very small, while the angle between "apple" and " bits" is very large because they represent different semantic groups so they have different meaning.
Imagine there is another word called "mango", then the word "mango" will point to a direction very similar to the "apple" direction and hence the angle between "apple" and "mango" will be very small as they both fall into the same semantic group - fruits. We measure the angle between vectors using cosine similarity which is the dot product between the two vectors.
We commonly use the cosine similarity, which is based on the dot product between the two vectors
Let us now review the self-attention mechanism and the casual mask
Self-Attention Mechanism
Self-Attention allows the model to relate words with every other word in the sentence. In vanilla Transformers, we use d_k = d_model = 512 (embedding size)
The Self-Attention is computed using the following formula:
In order to compute self attention we need to do the following three steps:
Let us see the implementation of all the three steps to compute attention.
Consider an input sentence - the sun is very bright
During preprocessing, we generally add a special token [SOS] to indicate the start of sentence. Hence,
Tokenized Input Sentence - [SOS], the, sun, is, very, bright
1. Computation of Scaled Dot Product
2. Apply Softmax to Scaled Dot Product
3. Multiply soft scaled dot product with values V
Self-Attention Mechanism - the reason behind the casual mask
As we saw earlier, a language model is a probabilistic model that assign probabilities to sequence of words and it allows us to compute the following:
i.e. we want to compute the probability of the word "India" being the next word in the sequence - "Delhi is a city in".
To model the probability distribution above, each word should only depend on words that come before it (left context).
So we want to condition the word "India" only on the words that come before it i.e. "Delhi is a city in". So our model should only be able to watch the left part of the sentence to predict the next token which is also called the left context. We achieve this by introducing casual mask.
Self-Attention Mechanism - Casual Mask
For every word or token we want to have zero interaction with their corresponding future words (or right context) i.e. the next word should only depend on their left context and not on the words present in their right context. Hence the values in the upper triangular matrix of soft scaled dot product matrix.
The following figure clearly illustrates the scenario:
The above soft scaled dot product can be achieved if we replace the values of the upper triangular matrix of scaled dot product matrix with – ∞, before applying the softmax operation.
The following figure clearly illustrate the scenario:
But this is not what is happening in normal Self-Attention because we are able to relate tokens that comes in the future with tokens that comes in the past.
We will see later that in BERT we make use of both the left and the right context.
Let us now explore the BERT model.
BERT
BERT stands for Bidirectional Encoder Representations from Transformers.
Developed in 2018 by Google researchers, BERT is one of the first LLMs. The introduction of BERT brought about a notable progress in the realm of encoder-only transformer architecture. This particular architecture solely comprises multiple layers of bidirectional self-attention and a feed-forward transformation, each accompanied by a residual connection and layer normalization.
The goal of any given NLP technique is to understand human language as it is spoken naturally. BERT's primary technical advancement involves the utilization of bidirectional training from the Transformer, a widely used attention model, for language modeling.
Unlike directional models that read the text input in a sequential manner, either from left-to-right or right-to-left, the Transformer encoder takes in the entire sequence of words at once. Hence, it is classified as bidirectional, although it would be more precise to describe it as non-directional. This attribute enables the model to comprehend the context of a word by considering all the words surrounding it, both to the left and right.
Conventional language models are typically directional models and hence they analyze text in a linear fashion, moving either from left to right or right to left. This approach restricts the model's understanding to only the nearby context before the specific word being examined. In contrast, BERT employs a bidirectional strategy that takes into account both the preceding and following context of words within a sentence. Rather than processing text in a sequential manner, BERT simultaneously considers all words in a sentence.
By examining the entire sentence as a whole rather than analyzing it sequentially, BERT is able to gain a comprehensive understanding of all the words in the sentence.
BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
BERT's architecture is made up of layers of encoders of the Transformer model.
BERT was introduced with two pre-trained models:
Differences with the Vanilla Transformers:
For tokenization, BERT use the WordPiece tokenizer, which allows sub-words tokens. The vocabulary size used is approximately ~ 30,000 tokens.
WordPiece is a subword-based tokenization algorithm. It was first outlined in the paper “Japanese and Korean Voice Search (Schuster et al., 2012)”. The algorithm gained popularity through the famous state-of-the-art model BERT.
BERT's framework involves two-steps process:
Pre-Training
Unlike conventional language models that analyze text sequentially, moving either from left to right or right to left, BERT is pre-trained using the following two unsupervised tasks:
Before we deep delve into the concept of Masked Language Modeling, let us explore the importance of the left and the tight context.
Bidirectional Importance
Traditional language models typically process text in a linear manner, either from left to right or right to left. However, this sequential approach restricts the model's understanding to only the immediate context before the target word. In contrast, BERT employs a bidirectional approach that takes into account both the left and right context of words within a sentence. By examining the entire sentence as a whole rather than analyzing it sequentially, BERT is able to gain a comprehensive understanding of all the words in the sentence.
Importance of the Left Context
Let us analyze the telephonic conversation between an operator and a user for the understanding the importance of the left context.
It is evident that the Operator formulates responses based on the User's input. The User initiates the conversation, prompting a reply from the Operator, which in turn leads to a response from the User based on the Operator's previous statement. This process is known as utilizing the left context.
Importance of the Right Context
Consider a child who accidentally damages his teacher's cherished pen and is reluctant to confess the truth, opting instead to fabricate a lie.
As evident, the child bases the fabrication of his lie on what he intends to express subsequently. Regardless of the falsehood he fabricates, it will always be contingent upon the outcome he desire to reach (the pens being broken). And so this could be an intuition how we human use the right context.
1. Masked Language Modeling (MLM)
It is logical to assume that a deep bidirectional model is inherently more capable than either a left-to-right model or the shallow combination of a left-to-right and a right-to-left model.
Typically, a language model is constructed by training it on an unrelated task that aids in building a contextual comprehension of words within the model. Frequently, these tasks involve predicting the subsequent word or words in close proximity to one another. However, these training techniques cannot be applied to bidirectional models as they would inadvertently enable each word to indirectly anticipate itself. When encountering the same sentence from the opposite direction, there would already be a sense of familiarity and expectation.
In this scenario, the model can easily predict the target word. Moreover, there is no assurance that the fully trained model has grasped the contextual meaning of the words to a certain degree, rather than solely concentrating on optimizing the straightforward predictions.
So how does BERT manage to pre-train bidirectionally?
In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random (i.e. replace with [MASK] token), and then predict these masked tokens - not the entire input sequence. We refer to this procedure as a Masked Language Modeling (MLM).
It is also known as the Cloze task. It means that randomly selected words in a sentence are masked, and the model must predict the correct word given the left and right context.
For Example, consider an input sentence - "Delhi, being the capital of India, is home to numerous government offices". Assuming that the randomly selected word is "capital," we will proceed to mask it. The following figure clearly illustrate the masking:
BERT’s job is to figure out what these masked words are by looking at the words around them. It’s like a game of guessing where some words are missing, and BERT tries to fill in the blanks.
BERT adds a special layer on top of its learning system to make these guesses. It then checks how close its guesses are to the actual hidden words. It does this by converting its guesses into probabilities.
领英推荐
How does the BERT model compute these predicted probabilities?
In technical terms, the prediction of the output words requires:
The BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked tokens.
Left and Right context in BERT
BERT uses bidirectional self-attention (instead of masked self-attention), the model can look at the entire sequence both before and after the masked token to make a prediction.
Let's see how BERT uses the left and the right context.
In contrast to previous methods, the current approach to computing attention does not involve the use of a casual mask to eliminate the interaction of words with subsequent words. For instance, in masked self-attention, we used to replace the values in the upper triangular matrix of the scaled dot product matrix with minus infinity to prevent interactions with future words. However, in BERT, we do not employ any casual mask. This implies that each token in a sentence attends to both the tokens on its left and the tokens on its right.
MLM - Twisted Masking Procedure of Input Tokens
The key element to achieving bidirectional learning in BERT (and every LLM based on transformers) is the attention mechanism. This mechanism is based on masked language modeling (MLM). By masking a word in a sentence, this technique forces the model to analyze the remaining words in both directions in the sentence to increase the chances of predicting the correct masked word.
Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. This is mitigated by a subtle twist in how we mask the input tokens.
Approximately 15% of the words are masked at random while pre-training, but all of the masked words are not replaced by the [MASK] token.
If the i-th token is chosen at random, we replace the i-th token
Lets review an example to get a better sense of masking procedure mentioned above.
Consider an input sentence - "Delhi, being the capital of India, is home to numerous government offices."
Assuming that the randomly selected token to be masked is "capital", the following figure clearly illustrates the masking process:
Let us now visualize the whole training process of Masked Language Modeling (MLM) with an example.
MLM - Training Visualization
Input Sentence - "Delhi, being the capital of India, is home to numerous government offices."
Assuming the randomly selected word to be masked is "capital" as before, this will lead to a masked input sequence containing 14 tokens that will be inputted into BERT. Given that BERT is a transformer model, it will generate an output sequence of 14 tokens due to the input sequence consisting of 14 tokens.
BERT will specifically focus on the masked position, which is the 4th position (i.e. token-3, with index = 3). Subsequently, we will compare token-3 (at the 4th position) with the target token, "capital", and calculate the loss. Finally, we will run the back-propagation process to update the weights.
Essentially, our objective is for the resulting token, 'token-3' to represent 'capital'.
Let us now explore the 2nd task of pre-training procedure - Next Sentence Prediction
2. Next Sentence Prediction (NSP)
Numerous crucial downstream tasks, like Question Answering (QA) and Natural Language Inference (NLI), rely on comprehending the connection between two sentences, a factor not explicitly addressed by language modeling.
For example, in Question Answering, the ability to accurately determine how a question relates to a given passage of text is essential for providing a relevant and accurate answer. Similarly, in Natural Language Inference, the ability to discern the logical relationship between two sentences, such as whether one sentence contradicts, entails, or is neutral with respect to the other, is crucial for making accurate inferences and drawing conclusions.
In order to train a model that understands sentence relationships, we pre-train for a Binarized Next Sentence Prediction task that can be trivially generated from any monolingual corpus.
In this Next Sentence prediction pre-training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document.
In this task, two sentences - A and B - are chosen, for every pre-training instance, as follows:
The assumption is that the random sentence will be disconnected from the first sentence.
We will analyze a very simple example in order to get a feeling of this task. The example is simple in a sense that we will not use special tokens just for the illustration purpose.
Lets say we have a corpus of a famous nursery rhyme - "Jack and Jill" as follows:
Now let us construct two pre-training instances (for NSP task) as follows:
Let us feed the two pre-training instances into the BERT model and check the predictions.
Now a natural question to ask is that how does a BERT model predicted the label probabilities? To get an answer, first we need to the know clever usage of Input representations used by BERT.
BERT- Input Representation
In order to enable BERT to effectively handle a range of down-stream tasks, our input representation has the capability to clearly represent either a single sentence or a pair of sentences within a single input token sequence.
A "sentence" may consist of any continuous text, regardless of whether it forms a proper linguistic sentence.
A "sequence" denotes the input token sequence given to BERT, which could consist of either a single sentence or two sentences combined.
There are two problems in the BERT's input:
Problem-1: In contrast to RNNs, where inputs are fed sequentially, all the inputs are simultaneously fed in one step in this model. However, the model is unable to retain the ordering of the input tokens. It is important to note that the order of words holds significance in every language, both in terms of meaning and syntax.
Problem-2: To effectively carry out the Next Sentence Prediction task, it is essential to differentiate between sentences A and B.
The solution to both of these problems involves incorporating embeddings that include the necessary information into our original tokens, then utilizing the outcome as the input for our BERT model.
The following embeddings are added to token embeddings:
To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:
Let us now see few examples including the use of special tokens [CLS] and [SEP] tokens.
NSP - Examples
The next sentence prediction task can be illustrated in the following examples.
How does the BERT model predicts if the second sentence (B) follows the first sentence (A) with their probabilities?
To predict if the second sentence is indeed connected to the first, the following steps are performed:
Let us now visualize the training procedure of Next Sentence Prediction with the help of an example.
NSP - Training Visualization
Input Sentence - "The man went to the store. He bought gallon of milk."
Assuming the randomly selected word to be masked is "the" in the first sentence - "The man went to the store." and word "of" in the second sentence - "He bought gallon of milk."
After pre-processing, the input sequence including special tokens will be the following:
Input Sequence - [CLS] the man went to [MASK] store [SEP] he bought gallon [MASK] milk
The input sequence is now inputted into BERT. Given that BERT is a transformer model, it will generate an output sequence of 14 tokens due to the input sequence consisting of 14 tokens.
We will only consider the first token of the output, which is token-0, corresponding to the first input [CLS] token. Token-0 will then be fed into a linear layer with two output features, IsNext and NotNext, followed by the application of softmax. We will compare token-0 to the target IsNext, expecting BERT to predict IsNext as the pair of sentences inputted are connected.
Next, we will compute the loss using cross entropy loss and proceed with the back propagation process to update the weights. This is how we trained BERT on the Next Sentence Prediction task.
Importance of [CLS] token in BERT
Lets review how the [CLS] tokens works.
The [CLS] token always interact with all the other tokens because we didn't apply any casual mask. So we can consider that the [CLS] token acts as a token that captures the information from all the other tokens because the normal attention matrix didn't applies any casual masking before applying the softmax. So all of the attention values will be actually learned by the model and this is idea behind the [CLS] token.
If we perform matrix multiplication between the Soft Scaled Dot product and the Values V matrix, we will obtain the Attention Output matrix. However, as there are no zero values in the Soft Scaled Dot Product Matrix, the CLS token in the first row of the Attention Output matrix will have access to the attention scores of all the tokens. In essence, the [CLS] token in the attention output matrix (corresponding to the first row) will consolidate all the attention scores and relationships with all the tokens.
CEO - Investor Analogy
The [CLS] can be thought of as a chief executive officer (CEO) in a company, with you playing the role of an investor. As an investor, you don't seek information from individual employees; instead, you approach the CEO. Similarly, the [CLS] is responsible for gathering the required information from every word of the sentence in order to achieve the desired outcome. It serves as the aggregator of all the information within the sentence, enabling us to classify it. Hence, it is referred to as [CLS].
BERT Pre-Training
When training the BERT model, both Masked LM and Next Sentence Prediction tasks are trained simultaneously with the goal of minimizing the combined loss function of the two tasks.
This results in a language model that possesses improved capabilities in comprehending the context within sentences and the connections between them. This approach ultimately leads to the development of a robust language model.
But how does one predict output for two different tasks simultaneously?
BERT Output
How can the output for two distinct tasks be predicted simultaneously?
To obtain the solution, one can utilize a distinct FFNN + Softmax layer constructed on the basis of the outputs derived from the last encoder, which correspond to the desired input tokens. The outputs from the final encoder will be referred to as the final states.
The first input token is always a special classification [CLS] token. The final state associated with this token is utilized as the comprehensive (or aggregate) sequence representation for classification assignments and is employed for the Next Sentence Prediction. In this prediction, it is inputted into a FFNN + Softmax layer that calculates probabilities for the labels "IsNext" or "NotNext".
The final states corresponding to [MASK] tokens is fed into FFNN+Softmax to predict the masked word.
Note: Here FFNN is an acronym for Feed Forward Neural Network.
To be continued in Part-2, where we will explore the concept of Fine-Tuning.
References:
- Vaswani et al. (2017), Attention is all you need
Senior Data Scientist
9 个月Very useful