登录查看更多内容

Large Language Models II: Attention, Transformers and LLMs

Mitul Tiwari

Director of AI and Machine Learning Engineering

发布日期: 2024年1月22日

Here is the second part of the Language Model post series covering Transformer, Attention and architecture of many modern LLMs such as Mistral and Llama-2. Check out the first part of the blog post Language Model in NLP for background.

Transformer model architecture was first proposed in the seminal paper Attention is all you need by Vaswani et al. Transformer model architecture, which uses self-attention, ushered language processing in another era with leaps and bound progress in many of the language tasks. Self-attention is the basic building block even now in many of the state of the art models like GPT-4, Mixtral, and Llama-2.

My personal journey with attention mechanisms began at Passage AI in early 2018, where we harnessed this technology for Question-Answering from a knowledge base. We developed a bi-directional attention flow model for machine reading comprehension, enabling a virtual agent (or a customer service chatbot) to answer queries from knowledge base articles. ?Here is the blog post with more details.

So what is a transformer network?

Transformer Architecture (Encoder block is on the left and Decoder block is on the right)

Transformer network relies on the self-attention mechanism. Self-attention allows a network to attend to different parts of the input sequence of tokens. Transformer processes the entire input sequence in parallel. It consists of 2 parts: encoder and decoder. Encoder block (layers on the left in the above diagram) takes the input sequence and generates representation of input sequence using multiple layers of attention block. Decoder block takes these representations and generates output sequences.

What is self-attention?

Self-attention allows a network to attend to the different parts of the input sequence. Input vector X embeds (or gets linear transformation) to 3 vectors - Q, K, V by weight vectors (W_Q, W_K, W_V) learned during training. Self-attention vector Z is computed by a function combining Q, K, V. This way the self-attention network learns to give different weights to different parts of input.?

Z = softmax (Q*K_T / norm) * V

Z is also referred to as an attention head.

Code snippet for attention head (without mask and dropout)

What is multi-head attention?

Multi-head attention mechanism allows one to learn different representations of the input sequence (of text), and enables to attend to different positions in different representations. Multi-head attention is computed using concatenating each attention head output and linear transformation of that using an output vector W_O:

MultiHead(Q, K, V) = Concat(Z_1, Z_2, …, Z_h) * W_O

Multi-Head Attention explanation by Jay Alammar in The Illustrated Transformer

Using Transformer architecture, Google released a language representation model BERT in 2018.?

What is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers. It is a pre-trained language model. Bidirectional representations come from unlabeled text by jointly conditioning on both left and right context of input text. Using BERT one needs only an additional output layer to create models for many language tasks such as text classification and question-answering.

Pre-training: BERT was pre-trained using mask language model and next sentence prediction tasks. Masked language model (MLM) - randomly masks the tokens from the input, and the objective is to predict the original tokens based on surrounding context. Next sentence prediction: use sentence pairs and objective is to predict whether 2nd sentence comes after the first.

Fine-tuning for a task:? give training data for the task and train all the BERT layers on that input. It requires less data for fine-tuning and works well. One can use pre-trained weights of BERT as embedding and train an output neural layer to predict the task (e.g. text classification, QnA, etc.).

Architecture: BERT uses a multi-layer bidirectional transfer encoder. Transformer architecture allows capturing dependencies between words. This helps create contextual embedding of words/tokens. Here is BERT model architecture.?

In our quest for innovation at ServiceNow, we embraced the power of BERT architecture to create a sophisticated enterprise language model that has been integral in enhancing the Natural Language Understanding (NLU) capabilities of our Virtual Agent and Question-Answering capability in Search since its inception in early 2021. Here is the blog post describing the use case and our journey in detail for enterprise service automation.?

Fabio Moioli 8 个月前

Beyond Ordinary: Unpacking the Innovations of…

ChandraKumar R Pillai 1 个月前

Decoding Transformers: The Heart of Large Language…

Avijit Swain 9 个月前

GPT*

GPT series of models are auto-regressive (only look at the tokens on the left of the sequence) and decoder only Transformer architecture. GPT* use next word prediction as the training task in pre-training step. Here is GPT-2 model architecture:

Llama-2

Llama-2 is also an autoregressive decoder only LLM from Meta. Llama2-7B uses 32 attention heads in each layer with 32 layers of attention blocks. Here is Llama2-7B model architecture:

Llama-3

Llama-3 models comes in four distinct variants, each supporting 8k token context length (up from 4k), 128k tokens vocabulary (up from 32k). Architecture: Llama-3-8B mirrors its predecessor Llama2-7B but with enhanced capabilities. It features 32 attention heads across 32 layers of attention blocks (same as Llama2-7B). Here is the architecture:

Mistral

Mistral is another good open source LLM with 7B parameters. Mistral architecture is similar to Llama-2, and also uses 32 layers of 32 attention heads in each layer. Here is Mistral architecture:

Mixtral

Mixtral uses 8 mixtures of experts on top of 32 layers of attention blocks. It uses a routing layer gate to pick 2 experts at the inference time, which helps with reduced latency. Here is the model architecture of Mixtral:?

Phi-3

Microsoft launched the Phi-3 model in three distinct variants, including the compact Phi-3 mini. This smaller version model has 3.8 billion parameters, and with 4-bit quantization, it consumes less than 2GB of memory. This makes it feasible for use in a wide range of devices, even smartphones! In terms of architecture, the Phi-3 mini shares similarities with the Llama-3-8B, featuring 32 attention heads and 32 hidden layers. It supports a substantial context length of up to 4k and 128k, with a vocabulary size of 32k.

In summary, this post has provided an exploration of attention mechanisms and transformer architecture, essential components in the world of modern Large Language Models (LLMs). We have examined the architectures of innovative models like Mixtral, Mistral-7B, Llama-2, GPT-2, and BERT, showcasing how the attention mechanism is a pivotal element in these advanced systems.

This exploration underscores the profound impact that attention mechanisms and transformers have in shaping the future of NLP and AI. As we continue to witness and contribute to the evolution of these technologies, it's clear that they remain at the forefront of AI research and development, driving forward the capabilities of language understanding and generation.

Updated on April 22, 2024.

Zoheb Vacheri

Hiring ML Generalists - Software Engineering Manager at Meta

4 个月

Link to your first blog post seems to be wrong.

Piotr Malicki

8 个月

Looking forward to diving in! ??

1 次回应

Abhishek Kannath

8 个月

Sounds like an enlightening read! Can't wait to check it out. ??

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Large Language Models II: Attention, Transformers and LLMs

Mitul Tiwari

Director of AI and Machine Learning Engineering

So what is a transformer network?

What is self-attention?

What is multi-head attention?

What is BERT?

领英推荐

GPT*

Llama-2

Llama-3

Mistral

Mixtral

更多精彩文章

社区洞察

其他会员也浏览了

Crafting Intelligence: The Art of Tailoring Large Language Models for Precision and Relevance

Large Language Models vs. Short Language Models

??Top ML Papers of the Week

How to prompt like a pro: Why do different language models react differently?

How Large Language Models (LLMs) Work and How They Are Developed

From pixels to information with Document AI

Understanding Embeddings

Unlocking the Power of Retrieval-Augmented Generation (RAG) in the Age of Long-Context Language Models: A Critical Perspective

Exploring the Power of Large Language Models (LLMs): A New Era in AI

Will Long-Context LLMs Cause the Extinction of RAG?

So what is a transformer network?

What is self-attention?

What is multi-head attention?

What is BERT?

领英推荐

GPT*

Llama-2

Llama-3

Mistral

Mixtral

AI Agents, Agentic Patterns and DSPy

2024年9月16日

Mixture of experts LLMs

2024年5月31日

Domain Adaptation of Large Language Models and Aligning to Human Preferences

2024年2月12日

Thoughts on BayLearn 2023

2024年1月5日

Exploring Zero-shot and Few-Shot Techniques for Intent Classification using LLMs

2023年8月14日

Thoughts on TheWeb Conferences

2023年2月26日

Using LLMs for Data Augmentation to Recognize Dialog Act

2022年12月20日

Thoughts on Web Search & Data Mining Conferences

2022年11月6日

Language Models in NLP

2022年10月3日

Dialogue System Technology: a challenge, datasets, and recent advances

2022年8月29日

社区洞察

其他会员也浏览了

Crafting Intelligence: The Art of Tailoring Large Language Models for Precision and Relevance

Large Language Models vs. Short Language Models

??Top ML Papers of the Week

How to prompt like a pro: Why do different language models react differently?

How Large Language Models (LLMs) Work and How They Are Developed

From pixels to information with Document AI

Understanding Embeddings

Unlocking the Power of Retrieval-Augmented Generation (RAG) in the Age of Long-Context Language Models: A Critical Perspective

Exploring the Power of Large Language Models (LLMs): A New Era in AI

Will Long-Context LLMs Cause the Extinction of RAG?