Understanding Transformers in Natural Language Processing
Sarthak Pattnaik
Senior Software Engineer at HCLTech | MS Applied Data Analytics at Boston University
The T in ChatGPT, GPT-3, GPT-4 stands for Transformers. A Transformer is an attention-based Sequence-to-Sequence encoder-decoder architecture which can capture long range dependencies and contextual information. A transformer analyses the relationship between a pair of tokens (in the case of text, the token is words) which is referred to as attention.
HOW DOES A TRANSFORMER WORK?
The working of a transformer involves a few sequential operations. First, it converts the text into a sequence of numbers so that the model is dealing with continuous instead of discrete data points. This process is referred to as vector embedding. There are a multitude of techniques that can be used for vector embedding, a few prominent ones being character embedding or word embedding. Transformers do not have an inherent understanding of the position of words in a sentence, therefore they combine the embeddings with sinusoidal functions and pass the information to the model. The embedding and encoding then proceeds through multiple self-attention layers and feed-forward neural networks to capture dependencies in the input and to add expressive power to the model through complex transformations. The output from the Encoder layer then goes to the decoder layer where each decoder has two sublayers namely the self-attention and the encoder-decoder attention to understand the context of a specific sentence. Finally, because the dot products produce results between negative and positive infinity a SoftMax function is applied to map the output to the same size as the vocabulary.
The Transformer consists of two broad layers: The encoder layer and the decoder layer (See Figure 1).?
Each part of the transformer can be used independently based on the task. Encoder model use only the encoder part of the transformer and are best suited for tasks that require a complete understanding of a sentence. The pretraining of this model involves concocting a particular statement to make it erroneous to train the model to fix the error. On the other hand, decoder models are best for use cases that deal with text generation. The model is trained to complete a sentence using the next word. Language models that use both encoders and decoders are referred to as sequence-to-sequence models.
TRAINING A TRANSFORMER
Transformer models are trained on large amounts of data in a self-supervised way where they learn without humans ascribing labels to the data. The model undergoes a training phase where it inculcates a statistical understanding of the text data on which it is trained. The result generated from the model is often underwhelming, therefore the model usually goes through a subsequent phase of learning called transfer learning where the user annotates the training data by putting labels on it. The first phase of self-supervised learning is also called pretraining where a model is trained from scratch on gigantic amounts of data with randomly initialized weights. This training phase takes several weeks and requires tremendous amounts of computing power since neural networks are trained over millions of epochs so that the best hyperparameters would be chosen that provide the best results. Therefore, rather than training models from scratch fine-tuning an already pre-trained model is encouraged since one can leverage the knowledge that the model has gained through initial training and build on top of that initial learning.
?
MODEL FINE-TUNING
Fine-tuning a Large Language Model is a much simpler process than pretraining that involves self-supervised learning using a dataset with humongous records over multiple epochs. In contrast, model fine-tuning is performed using a smaller dataset with labels which makes it a supervised learning process. There are numerous ways to perform fine-tuning. One of the ways to fine tune a model is to use prompts to instruct the model what exactly needs to be done. This is called instruction fine-tuning.
Curating a fine-tuning dataset involves transforming a raw dataset using sample prompts with instruction templates for specific tasks such as classification, text generation, text summarization. Once the dataset is ready, it is split into training, validation, and testing. The training data is passed to the pre-trained model and the labels that the model generates is compared to the label in the original dataset. Since the labels are encodings, the values can be compared to the original text using cross-entropy loss function and using backpropagation, model weights can be updated. This process must be repeated over many epochs to set the weights of the model that provide the best accuracy. Use validation dataset leverage the hold-out validation method to calculate the validation accuracy and perform a final accuracy check using the test data.
CATASTROPHIC FORGETTING
It is often recommended to avoid fine tuning a large language model on a single task as it leads to catastrophic forgetting. Catastrophic forgetting occurs when a model is fine-tuned to perform exceptionally well in a single task but consequently its performance in other tasks plummets significantly. For example, a model that can perform sentiment analysis exceptionally but underperforms when given a task to perform sentence completion is still undesirable. The solution is to fine-tune the model to perform multiple tasks at the same time. Another option is to consider Parameter Efficient Fine-tuning (PEFT).
?
MULTI-TASK FINE TUNING
A method of fine-tuning when multiple prompts are used consisting of an eclectic set of tasks that encapsulate all possible scenarios that the model must keep in mind. The fine-tuning requires a considerably large dataset with multiple examples from each set of tasks that the model needs to perform. A good example is the Flat T5 fine tuned version of the T5 LLM.
?
MODEL EVALUATION (LLM)
There are two evaluation metrics used for different tasks when it comes to LLMs, these include the ROUGE score for text summarization and BLEU score used for text translation. Some important evaluation metrices include:
?
ROUGE-1 Recall = (unigram matches/unigrams in references)
ROUGE-1 Precision = (unigram matches/unigrams in output)
ROUGE-1 F1 = 2*((precision*recall)/(precision+recall))
ROUGE-2 Recall = (bigram matches/bigrams in references)
ROUGE-2 Precision = (bigram matches/bigrams in output)
领英推荐
ROUGE-2 F1 = 2*((precision*recall)/(precision+recall))
The ROUGE-2 scores provide a more accurate description of the performance of the model and oftentimes the recall and precision scores are lower than that of ROUGE-1 scores.
The preferred approach is to identify the longest matching sequence in the original text and the generated output and calculate the performance metrices accordingly.
The drawback of using the ROUGE score is that at times the output from the model is disparate from the original text but the scores calculate using the formula for ROUGE will not reflect the disparity. Let us illustrate the scenario using the example below:
Original text: It is cold outside.
Model Response: Cold cold cold cold.
Even though it is conspicuous that the model is performing poorly, however, the precision score as per ROUGE-1 will be 1 i.e. (4/4) because the matching from the output to the reference is not positional.
The BLEU score on the other hand is calculated by taking the average of precision across range of n-gram sizes.
?
PARAMETER EFFICIENT FINE-TUNING
The benefit of using parameter efficient fine tuning is that unlike other fine-tuning techniques where all of the model parameters are updated, in PEFT only a certain number of parameters are modified which preserves a lot of memory. Some techniques update a subset of parameters from the model (Selective Technique) while other techniques introduce new parameters and only make changes to those specific parameters (Additive Technique). Other techniques reparametrize the model weights using low rank representation (LoRA). Full fine-tuning creates a completely new model for each task and this might create a storage problem.
?
LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS (LoRA)
Let us revisit transformers and summarize the overall process and architecture. Tokens are generated from the input text and these tokens are then converted into embeddings. These embeddings are passed to the encoder and decoder parts of the transformer. Both encoder and decoder have two layers: self-attention layer and feed-forward neural network layer. Weights of these networks are learnt during pre-training. LoRA provides an efficient way for model parameter training by first, freezing all the parameter weights and concomitantly injecting two rank decomposition metrices. Once this is done, train the weights of the smaller matrices.
Using this approach to the transformer that is mentioned in the paper “attention is all you need”, we get the below results:
Dimensions of Transformer Weights: d x k = 512 x 64 = 32,768 parameters
In LoRA with rank r=8:
A has dimension r x k = 8 x 64 = 512 parameters
B has dimension d x r = 512 x 8 = 4,096 parameters
?
SOFT PROMPTS
Soft prompts are additional encodings that are added to the encoding vector. These soft prompts are optimized over time while all other parameters are frozen which is in stark contrast to the full-length fine-tuning approach where all model parameters are trained and subsequently updated during back propagation. For smaller LLMs full fine-tuning performs significantly better than soft-prompt fine tuning, however, as the size of the model increases, we observe a drastic improvement in the soft-prompt fine tuning approach.
REFERENCES
Data Scientist | Machine Learning and AI Research | MS in Applied Data Science
1 年Wonderful explanation!!