Large Language Models II: Attention, Transformers and LLMs
Evolutionary tree of LLMs (from “Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond” by Yang et. al)

Large Language Models II: Attention, Transformers and LLMs

Here is the second part of the Language Model post series covering Transformer, Attention and architecture of many modern LLMs such as Mistral and Llama-2. Check out the first part of the blog post Language Model in NLP for background.

Transformer model architecture was first proposed in the seminal paper Attention is all you need by Vaswani et al. Transformer model architecture, which uses self-attention, ushered language processing in another era with leaps and bound progress in many of the language tasks. Self-attention is the basic building block even now in many of the state of the art models like GPT-4, Mixtral, and Llama-2.

My personal journey with attention mechanisms began at Passage AI in early 2018, where we harnessed this technology for Question-Answering from a knowledge base. We developed a bi-directional attention flow model for machine reading comprehension, enabling a virtual agent (or a customer service chatbot) to answer queries from knowledge base articles. ?Here is the blog post with more details.

So what is a transformer network?

Transformer Architecture (Encoder block is on the left and Decoder block is on the right)

Transformer network relies on the self-attention mechanism. Self-attention allows a network to attend to different parts of the input sequence of tokens. Transformer processes the entire input sequence in parallel. It consists of 2 parts: encoder and decoder. Encoder block (layers on the left in the above diagram) takes the input sequence and generates representation of input sequence using multiple layers of attention block. Decoder block takes these representations and generates output sequences.

What is self-attention?

Self-attention allows a network to attend to the different parts of the input sequence. Input vector X embeds (or gets linear transformation) to 3 vectors - Q, K, V by weight vectors (W_Q, W_K, W_V) learned during training. Self-attention vector Z is computed by a function combining Q, K, V. This way the self-attention network learns to give different weights to different parts of input.?

Z = softmax (Q*K_T / norm) * V

Z is also referred to as an attention head.

Code snippet for attention head (without mask and dropout)

What is multi-head attention?

Multi-head attention mechanism allows one to learn different representations of the input sequence (of text), and enables to attend to different positions in different representations. Multi-head attention is computed using concatenating each attention head output and linear transformation of that using an output vector W_O:

MultiHead(Q, K, V) = Concat(Z_1, Z_2, …, Z_h) * W_O

Multi-Head Attention explanation by Jay Alammar in The Illustrated Transformer

Using Transformer architecture, Google released a language representation model BERT in 2018.?

What is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers. It is a pre-trained language model. Bidirectional representations come from unlabeled text by jointly conditioning on both left and right context of input text. Using BERT one needs only an additional output layer to create models for many language tasks such as text classification and question-answering.

Pre-training: BERT was pre-trained using mask language model and next sentence prediction tasks. Masked language model (MLM) - randomly masks the tokens from the input, and the objective is to predict the original tokens based on surrounding context. Next sentence prediction: use sentence pairs and objective is to predict whether 2nd sentence comes after the first.

Fine-tuning for a task:? give training data for the task and train all the BERT layers on that input. It requires less data for fine-tuning and works well. One can use pre-trained weights of BERT as embedding and train an output neural layer to predict the task (e.g. text classification, QnA, etc.).

Architecture: BERT uses a multi-layer bidirectional transfer encoder. Transformer architecture allows capturing dependencies between words. This helps create contextual embedding of words/tokens. Here is BERT model architecture.?

BERT Model Architecture

In our quest for innovation at ServiceNow, we embraced the power of BERT architecture to create a sophisticated enterprise language model that has been integral in enhancing the Natural Language Understanding (NLU) capabilities of our Virtual Agent and Question-Answering capability in Search since its inception in early 2021. Here is the blog post describing the use case and our journey in detail for enterprise service automation.?

GPT*

GPT series of models are auto-regressive (only look at the tokens on the left of the sequence) and decoder only Transformer architecture. GPT* use next word prediction as the training task in pre-training step. Here is GPT-2 model architecture:

GPT-2 Model Architecture

Llama-2

Llama-2 is also an autoregressive decoder only LLM from Meta. Llama2-7B uses 32 attention heads in each layer with 32 layers of attention blocks. Here is Llama2-7B model architecture:

Llama-2-7B Model Architecture


Llama-3

Llama-3 models comes in four distinct variants, each supporting 8k token context length (up from 4k), 128k tokens vocabulary (up from 32k). Architecture: Llama-3-8B mirrors its predecessor Llama2-7B but with enhanced capabilities. It features 32 attention heads across 32 layers of attention blocks (same as Llama2-7B). Here is the architecture:

Llama-3-8B Model Architecture


Mistral

Mistral is another good open source LLM with 7B parameters. Mistral architecture is similar to Llama-2, and also uses 32 layers of 32 attention heads in each layer. Here is Mistral architecture:

Mistral-7B Model Architecture

Mixtral

Mixtral uses 8 mixtures of experts on top of 32 layers of attention blocks. It uses a routing layer gate to pick 2 experts at the inference time, which helps with reduced latency. Here is the model architecture of Mixtral:?

Mixtral Model Architecture

Phi-3

Microsoft launched the Phi-3 model in three distinct variants, including the compact Phi-3 mini. This smaller version model has 3.8 billion parameters, and with 4-bit quantization, it consumes less than 2GB of memory. This makes it feasible for use in a wide range of devices, even smartphones! In terms of architecture, the Phi-3 mini shares similarities with the Llama-3-8B, featuring 32 attention heads and 32 hidden layers. It supports a substantial context length of up to 4k and 128k, with a vocabulary size of 32k.

Phi-3 Model Architecture

In summary, this post has provided an exploration of attention mechanisms and transformer architecture, essential components in the world of modern Large Language Models (LLMs). We have examined the architectures of innovative models like Mixtral, Mistral-7B, Llama-2, GPT-2, and BERT, showcasing how the attention mechanism is a pivotal element in these advanced systems.

This exploration underscores the profound impact that attention mechanisms and transformers have in shaping the future of NLP and AI. As we continue to witness and contribute to the evolution of these technologies, it's clear that they remain at the forefront of AI research and development, driving forward the capabilities of language understanding and generation.


Updated on April 22, 2024.

Zoheb Vacheri

Hiring ML Generalists - Software Engineering Manager at Meta

4 个月

Link to your first blog post seems to be wrong.

回复
Piotr Malicki

NSV Mastermind | Enthusiast AI & ML | Architect AI & ML | Architect Solutions AI & ML | AIOps / MLOps / DataOps Dev | Innovator MLOps & DataOps | NLP Aficionado | Unlocking the Power of AI for a Brighter Future??

8 个月

Looking forward to diving in! ??

Abhishek Kannath

RPA Developer Lead ?? | UiPath Certified Professional Automation Developer Associate ?? | AI & ML ?? | Engineer BTech CSE ?? | Java, Python, C# ?? | GCP Proficient ?? Banking Automation Expert ?? | GoUiPath.in ??

8 个月

Sounds like an enlightening read! Can't wait to check it out. ??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了