登录查看更多内容

Understanding Transformers in Natural Language Processing

Sarthak Pattnaik

Senior Software Engineer at HCLTech | MS Applied Data Analytics at Boston University

发布日期: 2024年3月11日

The T in ChatGPT, GPT-3, GPT-4 stands for Transformers. A Transformer is an attention-based Sequence-to-Sequence encoder-decoder architecture which can capture long range dependencies and contextual information. A transformer analyses the relationship between a pair of tokens (in the case of text, the token is words) which is referred to as attention.

HOW DOES A TRANSFORMER WORK?

The working of a transformer involves a few sequential operations. First, it converts the text into a sequence of numbers so that the model is dealing with continuous instead of discrete data points. This process is referred to as vector embedding. There are a multitude of techniques that can be used for vector embedding, a few prominent ones being character embedding or word embedding. Transformers do not have an inherent understanding of the position of words in a sentence, therefore they combine the embeddings with sinusoidal functions and pass the information to the model. The embedding and encoding then proceeds through multiple self-attention layers and feed-forward neural networks to capture dependencies in the input and to add expressive power to the model through complex transformations. The output from the Encoder layer then goes to the decoder layer where each decoder has two sublayers namely the self-attention and the encoder-decoder attention to understand the context of a specific sentence. Finally, because the dot products produce results between negative and positive infinity a SoftMax function is applied to map the output to the same size as the vocabulary.

The Transformer consists of two broad layers: The encoder layer and the decoder layer (See Figure 1).?

Each part of the transformer can be used independently based on the task. Encoder model use only the encoder part of the transformer and are best suited for tasks that require a complete understanding of a sentence. The pretraining of this model involves concocting a particular statement to make it erroneous to train the model to fix the error. On the other hand, decoder models are best for use cases that deal with text generation. The model is trained to complete a sentence using the next word. Language models that use both encoders and decoders are referred to as sequence-to-sequence models.

TRAINING A TRANSFORMER

Transformer models are trained on large amounts of data in a self-supervised way where they learn without humans ascribing labels to the data. The model undergoes a training phase where it inculcates a statistical understanding of the text data on which it is trained. The result generated from the model is often underwhelming, therefore the model usually goes through a subsequent phase of learning called transfer learning where the user annotates the training data by putting labels on it. The first phase of self-supervised learning is also called pretraining where a model is trained from scratch on gigantic amounts of data with randomly initialized weights. This training phase takes several weeks and requires tremendous amounts of computing power since neural networks are trained over millions of epochs so that the best hyperparameters would be chosen that provide the best results. Therefore, rather than training models from scratch fine-tuning an already pre-trained model is encouraged since one can leverage the knowledge that the model has gained through initial training and build on top of that initial learning.

MODEL FINE-TUNING

Fine-tuning a Large Language Model is a much simpler process than pretraining that involves self-supervised learning using a dataset with humongous records over multiple epochs. In contrast, model fine-tuning is performed using a smaller dataset with labels which makes it a supervised learning process. There are numerous ways to perform fine-tuning. One of the ways to fine tune a model is to use prompts to instruct the model what exactly needs to be done. This is called instruction fine-tuning.

Curating a fine-tuning dataset involves transforming a raw dataset using sample prompts with instruction templates for specific tasks such as classification, text generation, text summarization. Once the dataset is ready, it is split into training, validation, and testing. The training data is passed to the pre-trained model and the labels that the model generates is compared to the label in the original dataset. Since the labels are encodings, the values can be compared to the original text using cross-entropy loss function and using backpropagation, model weights can be updated. This process must be repeated over many epochs to set the weights of the model that provide the best accuracy. Use validation dataset leverage the hold-out validation method to calculate the validation accuracy and perform a final accuracy check using the test data.

CATASTROPHIC FORGETTING

It is often recommended to avoid fine tuning a large language model on a single task as it leads to catastrophic forgetting. Catastrophic forgetting occurs when a model is fine-tuned to perform exceptionally well in a single task but consequently its performance in other tasks plummets significantly. For example, a model that can perform sentiment analysis exceptionally but underperforms when given a task to perform sentence completion is still undesirable. The solution is to fine-tune the model to perform multiple tasks at the same time. Another option is to consider Parameter Efficient Fine-tuning (PEFT).

MULTI-TASK FINE TUNING

A method of fine-tuning when multiple prompts are used consisting of an eclectic set of tasks that encapsulate all possible scenarios that the model must keep in mind. The fine-tuning requires a considerably large dataset with multiple examples from each set of tasks that the model needs to perform. A good example is the Flat T5 fine tuned version of the T5 LLM.

MODEL EVALUATION (LLM)

There are two evaluation metrics used for different tasks when it comes to LLMs, these include the ROUGE score for text summarization and BLEU score used for text translation. Some important evaluation metrices include:

ROUGE-1 Recall = (unigram matches/unigrams in references)

ROUGE-1 Precision = (unigram matches/unigrams in output)

ROUGE-1 F1 = 2*((precision*recall)/(precision+recall))

ROUGE-2 Recall = (bigram matches/bigrams in references)

ROUGE-2 Precision = (bigram matches/bigrams in output)

领英推荐

Unleashing the Power of Chat GPT: A Beginner's Guide

Manoz Acharya 2 年前

Leveraging Large Language Models to Generate Business…

SDG Group 1 年前

DeepSeek: What does DeepSeek R1 mean for HealthTech?

Lloyd Price 1 个月前

ROUGE-2 F1 = 2*((precision*recall)/(precision+recall))

The ROUGE-2 scores provide a more accurate description of the performance of the model and oftentimes the recall and precision scores are lower than that of ROUGE-1 scores.

The preferred approach is to identify the longest matching sequence in the original text and the generated output and calculate the performance metrices accordingly.

The drawback of using the ROUGE score is that at times the output from the model is disparate from the original text but the scores calculate using the formula for ROUGE will not reflect the disparity. Let us illustrate the scenario using the example below:

Original text: It is cold outside.

Model Response: Cold cold cold cold.

Even though it is conspicuous that the model is performing poorly, however, the precision score as per ROUGE-1 will be 1 i.e. (4/4) because the matching from the output to the reference is not positional.

The BLEU score on the other hand is calculated by taking the average of precision across range of n-gram sizes.

PARAMETER EFFICIENT FINE-TUNING

The benefit of using parameter efficient fine tuning is that unlike other fine-tuning techniques where all of the model parameters are updated, in PEFT only a certain number of parameters are modified which preserves a lot of memory. Some techniques update a subset of parameters from the model (Selective Technique) while other techniques introduce new parameters and only make changes to those specific parameters (Additive Technique). Other techniques reparametrize the model weights using low rank representation (LoRA). Full fine-tuning creates a completely new model for each task and this might create a storage problem.

LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS (LoRA)

Let us revisit transformers and summarize the overall process and architecture. Tokens are generated from the input text and these tokens are then converted into embeddings. These embeddings are passed to the encoder and decoder parts of the transformer. Both encoder and decoder have two layers: self-attention layer and feed-forward neural network layer. Weights of these networks are learnt during pre-training. LoRA provides an efficient way for model parameter training by first, freezing all the parameter weights and concomitantly injecting two rank decomposition metrices. Once this is done, train the weights of the smaller matrices.

Using this approach to the transformer that is mentioned in the paper “attention is all you need”, we get the below results:

Dimensions of Transformer Weights: d x k = 512 x 64 = 32,768 parameters

In LoRA with rank r=8:

A has dimension r x k = 8 x 64 = 512 parameters

B has dimension d x r = 512 x 8 = 4,096 parameters

SOFT PROMPTS

Soft prompts are additional encodings that are added to the encoding vector. These soft prompts are optimized over time while all other parameters are frozen which is in stark contrast to the full-length fine-tuning approach where all model parameters are trained and subsequently updated during back propagation. For smaller LLMs full fine-tuning performs significantly better than soft-prompt fine tuning, however, as the size of the model increases, we observe a drastic improvement in the soft-prompt fine tuning approach.

REFERENCES

https://www.algolia.com/blog/ai/an-introduction-to-transformer-models-in-neural-networks-and-machine-learning/

https://huggingface.co/learn/nlp-course/

https://arxiv.org/abs/1706.03762

https://arxiv.org/abs/2104.08691

Shivam Bhardwaj

Data Scientist | Machine Learning and AI Research | MS in Applied Data Science

1 年

Wonderful explanation!!

1 次回应

查看更多评论

要查看或添加评论，请登录

Sarthak Pattnaik的更多文章

LLM Performance Using Langsmith

2024年11月18日

LLM Performance Using Langsmith

LangSmith is a powerful platform designed to help developers monitor, evaluate, and improve the performance of Large…

1 条评论
Agentic Framework AI

2024年11月16日

Agentic Framework AI

In recent years, the field of artificial intelligence has witnessed a significant shift towards the development of AI…

2 条评论
The Case Against AGI: Hilbert, Godel, Turing, Larson

2024年6月2日

The Case Against AGI: Hilbert, Godel, Turing, Larson

The argument for Artificial General Intelligence (AGI) has found pervasive acknowledgement from the Big Tech community…
Unleashing the Power of AI: A Deep Dive into RAG vs Fine-Tuning

2024年5月26日

Unleashing the Power of AI: A Deep Dive into RAG vs Fine-Tuning

Large Language Models are fundamentally used to predict the subsequent words in a phrase to logically complete a…

2 条评论
PrivateGPT: Safeguarding Sensitive Information in the Age of AI Chatbots

2024年4月16日

PrivateGPT: Safeguarding Sensitive Information in the Age of AI Chatbots

What is PrivateGPT? A large swath of private companies including luminary tech giants like Apple and Samsung have…

8 条评论
REINFORMCEMENT LEARNING FROM HUMAN FEEDBACK

2024年3月25日

REINFORMCEMENT LEARNING FROM HUMAN FEEDBACK

RLHF is an interesting concept that focuses on aligning the content curated by the Large Language Model (LLM) with…

1 条评论
LangChain - Question Answering using Vector Databases and Similarity Search, Evaluation, Agent

2024年3月19日

LangChain - Question Answering using Vector Databases and Similarity Search, Evaluation, Agent

The most interesting utility of LangChain is that once it is integrated with a large language model, one could use it…
LangChain Prompt Templates, Memory, and Chains

2024年3月18日

LangChain Prompt Templates, Memory, and Chains

LangChain is an open source LLM application development framework. LangChain has multiple modular components that can…

3 条评论
Prompt Engineering: Shaping Responses from Large Language Models

2024年2月29日

Prompt Engineering: Shaping Responses from Large Language Models

As Google is mired in controversy over the inaccuracies of its AI Large Language Model Gemini’s highly biased output…

5 条评论
Exploring ChatGPT: Advancements, Applications, and Ethical Considerations in AI Integration

2024年2月25日

Exploring ChatGPT: Advancements, Applications, and Ethical Considerations in AI Integration

The first demo of ChatGPT was released on November 30, 2022, and since then, it has spurred the conversation around…

11 条评论

See all articles

Understanding Transformers in Natural Language Processing

Sarthak Pattnaik

Senior Software Engineer at HCLTech | MS Applied Data Analytics at Boston University

领英推荐

Sarthak Pattnaik的更多文章

社区洞察

其他会员也浏览了

How Does AI Work? - Mustafa Mahmud HussAIn

Can AI read Minds?

List of 100+ Notable Large Language Model (LLMs) ??

How ChatGPT Works: Technology, Algorithms, and Security Challenges

FPGA-Accelerated Large Language Models Used for ChatGPT

Mastering Prompt Engineering: Unlocking the Full Potential of AI Interactions

?? Mastering AI Interaction: GenAI & Prompt Engineering (Part 1).....??

Whats the fuss about ChatGPT?

Are Large Language Models (LLM's) Capable of System 2.0 Thinking?

Prompt Engineering | Directional Stimulus Prompting...

领英推荐

Sarthak Pattnaik的更多文章

LLM Performance Using Langsmith

Agentic Framework AI

The Case Against AGI: Hilbert, Godel, Turing, Larson

Unleashing the Power of AI: A Deep Dive into RAG vs Fine-Tuning

PrivateGPT: Safeguarding Sensitive Information in the Age of AI Chatbots

REINFORMCEMENT LEARNING FROM HUMAN FEEDBACK

LangChain - Question Answering using Vector Databases and Similarity Search, Evaluation, Agent

LangChain Prompt Templates, Memory, and Chains

Prompt Engineering: Shaping Responses from Large Language Models

Exploring ChatGPT: Advancements, Applications, and Ethical Considerations in AI Integration

社区洞察

其他会员也浏览了

How Does AI Work? - Mustafa Mahmud HussAIn

Can AI read Minds?

List of 100+ Notable Large Language Model (LLMs) ??

How ChatGPT Works: Technology, Algorithms, and Security Challenges

FPGA-Accelerated Large Language Models Used for ChatGPT

Mastering Prompt Engineering: Unlocking the Full Potential of AI Interactions

?? Mastering AI Interaction: GenAI & Prompt Engineering (Part 1).....??

Whats the fuss about ChatGPT?

Are Large Language Models (LLM's) Capable of System 2.0 Thinking?

Prompt Engineering | Directional Stimulus Prompting...