The Evolution of Language Models: my notes
Ajay Taneja
Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics
1.?Introduction
?It has been a while that I had announced here on LinkedIn , about my next series of blogs starting from:
And then each subsequent blog going into every unit / component of the Transformer Architecture including:
The first blog in the above series on the Foundational principles of Deep Learning can be found here .
This article is about the evolution of language models – the topic itself is a good fit for a detailed textbook – however, the content here is a consolidation of my notes from various sources – including Natural Language Processing specialization on Coursera [https://www.deeplearning.ai/courses/natural-language-processing-specialization/ ],?Deep Learning Specialization by Andrew Ng [https://www.deeplearning.ai/courses/deep-learning-specialization/ ], several You-Tube Videos, open-source university lectures, open-source material of blog posts of other learners, research papers and my own interpretation of this fascinating subject.
Through this article, I have attempted to bring about the hierarchical evolution of language models talking about the following: n-gram models and their limitations and then moving into the Deep Learning era with Recurrent Neural Networks, Long Short-Term Memory Units (LSTMs) and the problems in these networks and then progressing into Transformers with a detailed emphasis on the Attention Mechanism, talking at a higher level the architecture of the Transformer Architecture.
At the time I started writing/consolidating my notes on this topic – which was 2-3 weeks ago – I had not envisaged that this article would become as lengthy as it has but for me – I think it’s worth the effort. I’d re-iterate that these notes are for my own future reference as I apply these concepts in my work and personal (fun) projects but I’m happier if the content is useful to other learners in my LinkedIn community.
2.?Introduction to Language Modelling
?What is Language Modelling?
Most simplistically, Language Modelling is a task of predicting what word comes next – given a sequence of words. For example, given a piece of text:
The students opened their ---------------
?
The possible answers could be,
Thus, a formal definition of a language model is:
Given a sequence of words x(1), x(2), …., x(t), a language model will compute the probability distribution of the next word x(t+1):
The above probability distribution is a conditional probability.
The word x(t+1) can be any word in the vocabulary:
The word x(t+1) will come from a vocabulary – that is, there is a pre-defined list of words that we are considering.
Thus, language models can be viewed as a classification task because there is a pre-defined number of possibilities.?
Other ways of thinking about Language Modelling
One can think of a Language Model as a system that assigns probability to a piece of text. For example, if we have some text x(1), …., x(T) then the probability of this text – according to the Language Model is:
3.?Daily use of Language Models
?
Some examples of Language models may be outlined as follows:
1.??????Auto-completion – next word prediction on smart phones
2. Word suggestions by Google:
3.???How to learn a Language Model?
?
In the pre-Deep Learning era – the answer to How to learn a Language Model would be “n-gram” model. Thus, let us discuss about n-grams.
Definition of n-gram
By definition an n-gram is a series of “n” consecutive words.
Thus,
A unigram is just all of the individual words.
?
A bigram will be a consecutive series of pair of words.
?
And so on for trigrams and 4 grams as shown below:
Thus, the core idea of an n-gram is that in order to predict what word comes next, we would collect a bunch of statistics of how frequent different n-grams are from some kind of training data and then use those words to predict the next word.
Let us add some detail:
To make an n-gram language model,
?So, by the definition of conditional probability, we can say this probability is the ratio of 2 different probabilities.
The question then remains:
How do we get the n gram and (n-1) probabilities – these are obtained by counting them in some large corpus of text – that is your training data.
That is, mathematically,
4.???Example to learn a 4-gram Language Model
?
Let us say we are going to learn a 4-gram language model and we have a piece of text that says:
As the proctor started the clock, the students opened their ---------
?
We want to predict the word in the blank above.
?
Since we are using a 4-gram language model, our assumption is that the word in the blank depends only n the last 3 words. So, we lose all of the context except the last 3 words as shown below:
So, the probability of a word “w” given given (n-1) = 3 words is:
Number of times – i.e., count of the words “student opened their w” divided by the count of n-1 words in the corpus n-1 words above being “students opened their”.
Thus, for example, that in the corpus/training data, we see:
Then,
And suppose,
a)??????“students opened their exams” occurred 100 times, then,
This is a classic example underscoring that it was “not” a good idea to discard the proctor context here i.e., the word “proctor” which was not a part of 4-gram model.
Considering the complete sentence:
As the proctor started the clock, the students opened their ---------
?
It is clear considering the complete sentence that “exams” was more likely to be the last word than books – although “books” had a higher probability than “exams” considering a 4-gram model. Thus, it is clear we were throwing away too much of context – that is one problem with a n-gram language model.
?
There are some other problems too:
Sparsity problems
a)??????what happens if the numerator is 0 – sometimes an uncommon word like “petridishes” might also be the word “w” – although petridishes is an uncommon word and might not occur in the training data – it might be possible the student is a Biology student and hence the last word might be “petridishes” – so if a word does not occur in the training data, n-gram will assign zero probability to that event
?
One solution to the problem might be to add some probability to every word in the vocabulary so that every word has a small probability of occurring. This is called “Smoothing” because every word has a small probability of occurrence.
?
b)?????Back—off to a trigram language model: Next problem is what if the denominator is 0 – that is what if we did not see the trigram “students opened their” in the training data? If this happened, we will not be able to calculate the probability distribution at all! If this happens, we would then back-off taking the probability of last 2 words than the last 3 words.
?
These kinds of problems become worse as you increase “n”
?
Below is a diagram describing the problems a bit succinctly:
Storage problems
Considering the example:
Here we need to think about what we will have to store in order to use the n-gram language model - for the numerator we will have to store all the counts of “students opened their” – as “n” increases the number / count increases and hence more storage – size of n gram model gets bigger
How incoherent the text generated from the Trigram model could be!
5.?Neural Networks for language modelling: Recurrent Neural Networks
?
Recurrent Neural Networks as discussed below are much more efficient than the n-gram models discussed because of the notion of memory as elaborated in detail in the subsequent paragraphs. They incorporate the concept involving ‘time-stepping’ and thus capture the semantic meaning of the sentence.
In the paragraphs below, I have attempted to evolve through he concept of recurrent neural networks starting from the concept of perceptron to a single layered neural network and then stacking such units and connecting them together incorporating the modelling of memory through the information being passed from the one unit ?to the next unit
?
5.1??Understanding Recurrent Neural Networks
Firstly, let us start from the very fundamentals by revisiting the concept of the perceptron and develop a solid understanding of the changes that should be done to the neural network architecture in order to handle sequential data. Starting from the concept of perceptron,?we defined the set of inputs from?x1 through xn?and each of these numbers multiplied by a weight matrix?and then they're going to all be added together?to form this internal state of the perceptron which we'll say is z and then this value z is?passed through a non-linear activation function?to produce a predictive output y_hat as shown in the figure below;
It may be recalled that with the perceptron, one can have multiple inputs coming in and since we know we’re talking of sequence modelling, these inputs can be considered as being from a single?time step in a sequence. We could extend from the single perceptron to now a layer of perceptron’s to yield multi-dimensional outputs as shown in the figure below:
It should be underscored that the above mechanism does not have a notion of time or a sequence. All the inputs and the outputs above can be thought of that coming from a fixed time step of a sequence.
Now – lets us simplify things. Let us simplify the diagram and collapse the hidden layer as shown:
Here, the input and the output vector are depicted of being of length m and length n respectively. Now, if we apply the same model repeatedly for each time step in the sequence, we get a sense of how we could handle the individual inputs across different time steps as shown in the figure below:
All the models depicted above are replicas of each other at different time steps. The output vector y_hat_t is going to be the function of the input at that time step as shown below:
Now taking a step back we know if we’re considering sequence data it is very likely that considering the sequence data the output of a label at a?particular time step is going to depend on inputs at the prior time steps – so, we cannot treat the individual time steps as isolated steps.?Thus, we have to consider the relationship which is inherent to the sequence data considering the inputs between different time steps. So how to address this?
?
We need to link the information in the computation of the network at different time steps to each other. Specifically, we are going to introduce internal memory or cell state denoted h_t;
The h_t is going to be the memory that will be maintained by the neurons and the network itself and this state can be passed from time step to time step across time. The key idea here is that by having the recurrence relation we’re capturing the notion of memory. So, what this means now is that the network output predictions and the computations are not only the function of the input at a time step but also the memory of the cell state denoted by h_t. Thus, the output depends both on the current input as well as the past computations and the past learning that occurs. One can define this relationship by means of functions that map the input to the outputs as below:
As we can see that we can describe the neurons via a recurrence relation which means the cell state depends upon the current input and again on prior cell states. It is exactly this idea (see figure?below) of recurrence relation that provides the intuition of the key operation behind the recurrent neural networks or RNNs and through the sections 3 and 4 below we build upon the understanding of the mathematics of the recurrence relation and the operations that define the RNN behaviour.
5.2??Recurrent Neural Networks: Mathematics
Now, let us formalize the discussion a bit. The key idea is that the RNNs as described above maintain an internal state defined by h_t which is updated at each time step as the sequence is processed and this is done by the recurrence relation which specifically defines how the state is updated at the time step. Specifically, we define the internal cell state h_t;
This internal cell state is defined by a function that can be parametrized by a set of weights “w”. These weights w will be learnt during training such a network. The function f_w is going to take as input both: the input at the current time step x_t as well as the prior cell state h_t-1;
?
The key feature of the RNNs is that they use the same function and the same parameters whilst processing the sequence and of course the weights change during the course of the training but during each iteration of the training the same set of weights are going to be applied across each of individual time steps.
5.3??RNN Computation: State Update and Output
RNN computations include both the internal cell state update h_t as well as the output prediction itself. Let us now walk-through how these RNN computations are defined:
Firstly, we are going to consider the input vector x_t and then we are going to update the hidden state, the updated hidden state is;
As seen in the above equation; the function used in the calculation of the hidden state is a standard neural network function as seen in the beginning of the section 2. Again, as evident from the above equation, this internal cell state h_t is going to depend upon both: the input x_t and previous cell state h_t-1 and we are going to multiply each of the terms by their respective weight matrices and we are going to add the result and apply a non-linear activation function to the sum of the 2 terms. The non-linear activation function here is going to be the hyperbolic tangent.
Then to generate the output at a given time step we take the internal hidden state at a given time step ‘’t’ and multiply by a separate weight matrix which inherently produces a modified version of the hidden state which actually forms the output prediction;
This gives the mathematics of how an RNN updates its hidden state and produces the predicted output. The RNNs described above can be represented as below;
It should be emphasized that in a RNN we will be re-using the same weight matrices at every time step.
领英推荐
5.4 ??Training the RNNs
For RNNs, the forward pass through the network consists of going forward across time and updating the cell state based on the input as well as the previous state and then generating the output. The loss is computed at each time step and then finally the individual losses are summed to get the total loss.
The Vanishing and Exploding Gradients problem in RNNs:
RNNs suffered a serious problem relating to vanishing and exploding gradients and thus were superseded by LSTMs and then Transformers! Let's understand what's the Vanishing and Exploding gradients problem
Let's say you're utilising RNNs for machine translation task. Each unit of RNN can be thought of as separate neural network unit and the connected to the successive unit. RNNs possess an input state, output stare and internal memory state which retains the information of preceding units.
We proceed forward as we'd proceed in a feed forward network and predict output of in each unit of RNN and carry the information (memory) forward.
Next for training the RNN we'd calculate the loss of each unit, and the total loss will be the sum of losses of all units. We then do Back-propagation which created problems. This is because if we're dealing with a long sequence, the back propagation will involve computing several gradients to update the weights and by the time we reach the first unit there would have several gradients computed especially for a long sequence problem
Now if the gradients is <1 the multiplication of several small numbers will create what is termed as "vanishing" gradient and if the gradient is >1 the multiplication only several numbers results in what's termed as exploding gradient
The problem of exploding gradients may be mitigated by “gradient clipping” that tries to constrain the large gradients so that the multiplication does not blow them up. The vanishing gradient problem may be mitigated through: activation function, weight initialisation and network architecture.?
6.??Neural Networks for language modelling: Long Short-Term Memory Units (LSTMs)
LSTMs may be termed as the best solutions of vanishing gradient problems of RNNs discussed above. Let’s have a high-level view of the architecture of LSTMs.
LSTMs may be termed as a special variety of RNNs that were designed to handle longer sequences of data – fundamentally, LSTMs learn “what to remember” and “what to forget”.
The basic anatomy of the LSTMs may be defined as being comprised of:
?
a)?Cell state – which can be thought of as its memory.
?
b) Hidden state – where computations are performed during the training to decide on what changes to make.
The hidden state has 3 gates to pass through before the entire operation is performed again. Each gate plays a role in deciding how much information to pass along and how much to leave behind. The series of gates physically allow the gradients to go unchanged so that the risk of vanishing and exploding gradients is mitigated.
?
6.1??Architecture of LSTMs
Let us now understand the architecture of LSTMs in an intuitive sense without involving any mathematics:
LSTMs incorporated a cell state and hidden state. The cell state got updated during each loop of the training process. The update to the cell state occurred through 3 gates (part of the hidden state) as described below.
The cell state could be thought of retaining the memory relating to the necessary context for long sequences .
The update to the cell state occurred through 3 gates which are termed as:
a) The "Forget" Gate
b) The "Input" Gate
c) The "Output" Gate
The "forget" gate looks at the information from the previous cell state and decides what information is required and what is not for the cell state to be updated. As it might be envisaged, this is?comprised of a Sigmoid layer, 0 to forget and 1 to remember!
The input gate decides what values are to be updated compared to the previous cell state. This is done through the sigmoid layer and then these values are passed through a tanh layer to get a vector of new values
The output state combines the above to create an update state.
The loop above is repeated during the each iteration of the training process and during each step the weights are updated.
Mathematically the update of the weights through each step of the training process helped keeping the gradients constant and closer to 1 avoiding the vanishing and exploding gradient problem of RNNs.
6.2??Key Ideas in Long-Short-Term-Memory Units
The key idea behind LSTMs is that they can selectively add or remove information to and from the internal cell states using structures called “Gates”. The Gates contain standard neural network layer like the “sigmoid” and pointwise multiplication.
Now, let us see what these gates are doing. For example, we have a sigmoidal activation function?- this is going to force anything that passes through the gate to be between 0 and 1. One can think of think of this as modulating and capturing how much of the input should be passed through between nothing [0] and everything [1] which effectively “gates” the flow of information.?LSTMs use this type of operation by first forgetting the irrelevant history secondly by storing the relevant information and thirdly by updating the internal cell state and finally (fourth) updating the output.?
6.3??How LSTMs process information?
The steps describing how LSTMs process information are described below:
1.?The first step is to forget the irrelevant parts of the previous state, and this is achieved by taking the previous state and passing it through one of the sigmoid gates which can be thought of as a step modulating the flow by how much information should be passed or kept out.
?
2.???The next step is to find out what part of the new information and what part of the old information is relevant, and this stored into the cell state.
The highlight of the LSTMs is that they maintain a separate value of the cell state ct in addition to what was introduced previously ht. The ct is going to be selectively updated by the gated operations.
3. Finally, we can return an output from the LSTM – this is an interacting layer – the output gate that can control what information that’s encoded in the cell state is ultimately outputted and?sent to the network as input in the following time step. This operation controls the value of the output yt as well as the cell state that’s passed from time step to time step in the form of ht.
The above steps are highlighted through figures below.
6.4?Key take away points from LSTMs:
?
?
7.??Problems in RNNs and LSTMs
?
Recurrent Neural Networks – as discussed above – were the state of the art of sequence-to-sequence modelling.
1.??RNNs are slow to train - the input data needs to be passed sequentially or serially one after the other. We need inputs of the previous state to make any operations on the current state. Such sequential flow does not make use of the GPUs which are essentially designed for parallel computation.
?
2. Further, RNNs cannot deal with long sequences very well which leads to the problem of vanishing and exploding gradients as explained above. LSTMs solve the problem of long sequences to an?extent,?but they are slower than RNNs!
8.?Transform Neural Network Architecture
?
Now, from the above discussion it was clear that we needed to bring all time steps together – to eliminate sequential processing – as well as to extract information related to the context / semantic meaning of the sentence from the input data. The key idea being – to be able to – identify and “attend” to what is important in a sequential stream of data. And this was the notion of “attention” or “self-attention” which is a powerful concept and was first introduced in the paper titled: Attention is All You Need published by Google during 2017.?
Attention – the key operation – in the Transformer architecture is a very intuitive idea. So, it may be worthwhile developing some intuition of the concept of Attention.?
8.1 Intuition Behind Attention
Extracting the most important features of an image
In this article, I will be focussing on the idea of self-attention - i.e., attending to most important parts of the input example. Consider an image as shown in the figure below:
This is the image of the superman and out goal is to extract information of the important features of this image. Looking at the image, we might be scanning this image in our mind’s pixel by pixel in our eyes and our brain might be doing some kind of processing wherein we can look at the important parts of the image – in the above we might focus on the attire and thus understand that it’s the image of the superman. Thus, what’s perhaps happening is that our brains are identifying which parts of the image are to be “attended” to and then extract he features which deserve the highest “Attention”.
?
You-Tube Video Search
The problem is similar to “search”. For example, one might be searching for searching for some videos related to “language models evolution” – this is our “query”.
We might get some possible outputs. These outputs contain some information related to the “title” – these outputs are termed as “keys”. Now, we want to compute a metric of similarity or relevance between the “query” and the “keys” and retain the videos of high similarity metric – the video of the metric of high similarity are the videos that we need to pay “attention” to and will subsequently watch those videos. The relevant videos of high similarity metric are called the “value”?
8.2 Demystifying the Attention Mechanism?
Processing the text through the Transformers:
Let us go back to the language example given a sentence of the form “Hi How are you” our goal is to “identify” and “attend” to features in the input which are relevant to the semantic meaning of the sentence.
Thus, we have a sequence, we have a order of words, we have to eliminate recurrence – that is, we pass all the words – i.e. word embeddings – into the network simultaneously. We still need to find a way to encode and capture the order and positional dependence. This is done through the positional encoding which captures the inherent order information present in the sequence. This is explained below.
How is text encoded through the Transformer architecture?
Transformers do not model the order of the input anywhere; it is important to encode the order of the input. This happened in positional encoding.?
Transformer has to know what comes after the other word – before or after – and not do any permutations. This is where positional embeddings come in. They are kind of a “hint” for the transformer about the whereabouts of the word within the sequence.
We add the positional embedding to the input embedding – making it move a certain distance from where it currently is as shown in the figure below:
Positional embedding are the identifiers that are added to the original word embedding for the Transformers to know the order of the sequence.
The positional embedding should fulfil certain requirements:
Every position must have the same identifier irrespective of the sequence length. It should be noted that the positional embeddings push the original input embedding, care should be taken to see that the input embedding do not get drifted away too far away so that the semantic meaning of the embedding is lost.
We could think of numbering the positional embedding as 1, 2, 3 but that violates the above rule.We thus need to keep the value of the positional embedding bounded.
To keep the values of the input embedding bounded, we choose functions of sine and cosine, sines and cosines are bounded between -1 and +1 range from – infinity to + infinity. We take both sines and cosines because: had we taken only sines we might have got some repeated position values which is not permitted by. Hence, we take both sines and cosines as shown below:
Above
pos = denotes the position/order of the word in the sentence
i = index in the embedding vector
d_model = embedding length
Arriving at the Query, Key, Value Matrices
Next, our goal is taking the encoding and figure out what to attend to exactly like the intuitive idea of You-Tube video search explained in the section above – that is – extracting the query, extracting the key and extracting the value and relate them to each other. We use the neural network layers to do exactly this.
Given the positional encoding, we apply a neural network layer transforming the positional encoding – first generating a query vector – then we use another neural network layer – which will have a separate set of weights to generate key vector and then another network layer with another separate set of weights to generate a value vector.
?
Computing the Attention Weighting
Now, given the Query, Key and Value matrices we compare them to each other to find out where in the self-input the network should attend to figure out what is important. We compute a Attention Score which is the similarity metric between the Query and the Key Matrices
IT should be underscored that the Query and Key similarity may be computed mathematically by taking the dot product as shown in the figure below. This is also known as “cosine similarity”.
This is the Attention Weighting showing what the network should attend to. This operation gives us a score defining how components of the input data are related to each other. As it may be noticed from the figure below, the words which are more closely related to each other will have a high attention score. These scores are passed through a softmax layer to transform the scores into numbers between 0 and 1.
?
Extracting features of High Attention
Finally, since we have this matrix – the attention score – that captures the notion of similarity, we can use the attention score to extract features that deserve high attention – and that is the final step in the self-attention mechanism.
We take the attention matrix and multiply it by the value vector to give us a transformed set of vectors form the initial input embedding vector to a series of vectors that deserve high attention.
The series of steps discussed above are illustrated below:
The Attention Scores which are formulated as a dot product of Q and K clearly illustrate how much each word is dependent on other words through the matrix shown (again below):
Several such attention heads are used in the Transformer architecture which results in the concept of Multi Headed Attention.
?
?Multi-Headed Attention
The operation of taking the positional encoding, passing it through a neural network unit for computation of query, key and value matrices respectively is called an Attention Head.
In Multi Headed Attention we split each of these vectors into 8 pieces. Each piece now being termed as an "Attention Head".
Thus, the Query, Key, Value vectors being learnt involved 8 pieces. Each piece being termed as Attention Head. Once the Query Key Value vectors were learnt for each head, the Similarity metric Q*K established the Attention matrix for EACH head.
Carrying out the process 8 times I. e. breaking down the Q K V into 8 heads resulted in much more contextually aware vectors.
And the parallel processing allowed out us to use GPUs for all the computational processing in the math going on!
?
Conclusions from the Attention Mechanism:
Following may be concluded from the discussions above:
1.??All the above operations encode the input to represent “attention” information.
2.???For a language model, the attention weights help us understand what parts of text to focus on to predict the next word.
3.???For a language translation task, it will help the decoder to focus on specific part of the text during decoding process.
4.???The above steps eliminated the need for sequential processing and have captured the semantic meaning of sentence through the attention mechanism.?
8.3 Understanding of the overall working of the Transformer at a high level
On reading the paper: “Attention is All You Need”, it doesn’t give you the impression that it was written with the intention of being the foundation of large language models – GPT, BERT (as discussed below). The paper was written with the intention of accomplishing a specific task related to language translation – therefore even though language translation through deep learning is not a topic of this blog it might be worthwhile to discuss at a very high level. Let us assume we’re translating from English to French.
?
The Transformer architecture consists of 2 parts: the Encoder and the Decoder. During training the Encoder will take the English words (word embeddings) and will generate word vectors simultaneously. These word vectors (as illustrated in the figure below) are contextually aware because of the Attention mechanism as discussed above in quite detail.
Next, to the Decoder – in the start – we pass the English (contextually aware) word vectors generated along with a start token signifying the start of the sentence and start generating the French words one after the other. Each time we want to determine the nth French word, we pass in all the English word vectors and the (n-1) French words generated up to that step.
To determine the model’s loss – we compare the predicted and the true class – the loss (cross entropy loss) is computed for every predicted word and added to get the total loss. The loss is then back propagated through the network to update the weights / parameters of the network. This is the Training a very high level that occurs in the Transformers.
What’s really important in the above architecture is that we have 2 components: ?an encoder and decoder that have a contextual understanding of language, if we stack the encoder we get “Bidirectional encoding of Transformer”, if we stack the decoder, we get the decoder we get “Generative Pre-trained Transformer”.
Following points may be noted:
a)??????The Encoder only models are good for tasks that require understanding of input such as sentiment classification and named entity recognition (NER)
?
b)?????Decoder only models are Good for generative tasks such as text generation – language models.
?
?
c)??????Encoder-Decoder models: Good for generative tasks such as text summarization or translation (many-to-many)
?
I will be ending this blog with this – the subsequent blogs will go into the detail of each unit of the transformer along with the code.
Process Automation & AI for Mechanical Engineering | Solution Engineer @ Synera
1 年Thanks for always sharing great summaries and insights.