Engineers Guide to AI - Decoding the Transformer Model

Engineers Guide to AI - Decoding the Transformer Model

I've been working on writing my own transformer and this series of blog posts is me trying to learn in public. I've found that writing (code or text) is a good way for me to test myself on my understanding of a topic. Subscribe I think I have a pretty interesting series in mind.


Large language models are all that everyone is talking about, but what really are they? The high level technical answer is that they are a result of training a deep learning architecture called "Transformers" on large amounts of text data. This training data can including everything from public websites to open source code. These models are trained to predict the next token (1 or more characters) when given one of more characters.

Human: When in Rome
AI: do as the cats do.        

To gain a more intuitive feel for transformers and language models like GPT, it can be helpful to draw analogies to concepts familiar to a software engineer. One way to do this is to think of a transformer model as a kind of "soft" key-value store. A traditional key-value store is designed for storing, retrieving, and managing associative arrays, also known as dictionaries or hash maps. You provide a 'key', and the system efficiently returns the corresponding 'value'.

database = {"dog": "barks", "cat": "meows", "cow": "moos"}        

Then, when you query this database with the key "dog", you get back "barks".

Now, imagine a "soft" key-value store, where you can ask not only for exact keys, but also keys that are "similar". Let's say you query this soft key-value store with "puppy". A traditional key-value store might return nothing because it does not have that exact key. But a soft key-value store might understand that "puppy" is similar to "dog" and return "barks". This kind of "soft" key-value association is the basis for more advanced language models.

Now, how does this relate to transformer models and GPT?

GPT is an example of a transformer model. Transformers handle data (like text) in a way that's similar to this soft key-value concept. Each word in a sentence is converted into a dense vector (a list of numbers), often referred to as an embedding. These vectors capture the semantics of the words. For example, the vectors for "dog" and "puppy" would be closer together than the vectors for "dog" and "grapefruit", reflecting that "dog" and "puppy" have similar meanings.

Transformers take this a step further and learn context-specific embeddings. Instead of having one vector per word, no matter its context, transformer models generate a different vector for each word depending on the other words in the sentence. In other words, the "value" of a word (its vector representation) depends not only on its "key" (the word itself) but also on the surrounding words.

In the field of natural language processing (NLP), an "embedding" is a learned (created during training) representation for text where words that have the same meaning have a similar representation.

In the case of transformers, the soft key-value store is extended into a self-attention mechanism. The self-attention mechanism in the transformer model allows it to pay different amounts of "attention" to different words in the input when generating the output. The attention mechanism is like a soft key-value store where the keys and values are the words in the sentence, but the amount of attention paid to each word can vary.

For example, given the sentence "The cat sat on the mat", when generating a vector for the word "sat", the transformer might pay a lot of attention to "cat" and "mat", and less attention to "the" and "on".

I've introduced some new words here "attention", "self-attention", these are specific terms created to identify the core mechanism of the Transformer model. the famous paper by Google, "Attention is all you need" helped popularize these terms.

At a high level, the attention mechanism is a way for neural networks to "pay attention" to certain pieces of information over others.

Self-attention is a specific type of attention mechanism and it's a key part of Transformer models.

In the context of processing a sentence, self-attention allows the model to look at other words in the sentence as it processes each individual word. In other words, it allows the model to "pay attention" to different words within the same sentence to gain a better understanding of the context.

For example, in the sentence "The cat, which already ate a fish, was not hungry", if the model is processing the word "was", self-attention allows it to associate "was" with "cat" (indicating who "was"), and with "already ate a fish" (indicating why it "was not hungry").

How Self-Attention Works

When processing a word, the model generates three vectors: a query vector, a key vector, and a value vector. These three vectors are initialized with random values and then updated at training time. This is also referred to as learning three different sets of weights.

  1. The query vector (Q) for a word represents the word in its current processing context.
  2. The key vectors (K) represent the other words in their roles as potential words to pay attention to.
  3. The value vectors (V) also represent the other words, but they're used in a different context. Once the model decides how much attention to pay to each word (using the query and key vectors), it uses the value vectors to compute its final representation of the current word.

The model computes the dot product of the query vector with every key vector, followed by a softmax (to make them sum to 1) operation, to get a probability distribution. This distribution shows how much each word should be "attended to" while encoding the current word.

This probability distribution is then used to create a weighted sum of the value vectors. The result is a new vector that's a combination of the other vectors in the sentence, weighted based on how much the model should "pay attention" to each word.

A Simple Example

Lets walk you through a simplified example of how self-attention would work using Python.

First, let's initialize some simple word embeddings for a sentence:

import numpy as np

# Word embeddings (for simplicity, random and very low dimensional)
embeddings = {
? ? "I": np.array([0.1, 0.3]),
? ? "love": np.array([0.4, 0.2]),
? ? "cats": np.array([0.2, 0.1])
}

# Our sentence
sentence = ["I", "love", "cats"]

# Represent our sentence as a matrix, where each row is a word embedding
embedding_matrix = np.array([embeddings[word] for word in sentence])        

The embedding_matrix is a 3x2 matrix (because we have 3 words and each word is a 2D vector), which might look like:

[ 
  [0.1, 0.3],
  [0.4, 0.2], 
  [0.2, 0.1] 
]        

Next, we need to compute the Query, Key, and Value matrices. In a real transformer model, these are computed by multiplying the embedding matrix by learned weight matrices. But in this simple example, we'll just use the embedding matrix for each.

# For simplicity, our Query, Key, and Value matrices 
# are just the embedding matrix 

Q = embedding_matrix 
K = embedding_matrix 
V = embedding_matrix         

Now we'll compute the attention scores. In self-attention, the attention score between two words is computed as the dot product of their embeddings, normalized by the square root of the dimensionality (2, in this case).

import numpy as np

# Compute unnormalized attention (dot product of Q and K, 
# along last dimension)

# K.T is the transpose of K
attention_unnormalized = np.dot(Q, K.T) 

# Normalize the attention scores with a softmax operation 
attention = np.exp(attention_unnormalized) / 
  np.sum(np.exp(attention_unnormalized), axis=-1, keepdims=True)         

The attention matrix gives us the attention scores between every pair of words in the sentence. Each row corresponds to a word, and the jth column of the ith row gives the attention score between word i and word j. Higher scores mean word i should pay more attention to word j.

Finally, we compute the output of the self-attention operation, which is a weighted sum of the Value vectors, weighted by the attention scores:

# Compute weighted sum of value vectors 
output = np.dot(attention, V)         

The output matrix is the result of the self-attention operation. Each row corresponds to a word, and gives a new embedding for that word which is a combination of the other words in the sentence, weighted by their attention scores.

This is a simplified example, but it covers the key steps in self-attention: computing attention scores based on the query and key vectors, and computing a weighted sum of the value vectors based on these scores. It's important to note that in a real transformer model, the query, key, and value vectors would be learned and would not just be the input embeddings, and there would also be many (8 approx.) "heads" of self-attention operating in parallel.

In these examples we used the Python Numpy library for vector operations (and math) there are others like JAX and PyTorch which could also be used. The key here is to remember that there are no loops in Transformers the operations are in parallel using very large vectors on the GPU. This is why large models require fast GPUs with lots of VRAM.

Once More

Consider you're developing a tool to auto-complete commands in a Linux terminal. Two of the most frequently entered commands in your dataset are:

  1. "Check the system log and find out whether it rebooted please."
  2. "Check the battery status and find out whether it drained please."

Now, the AI model has to predict the next action in the phrase "Check the ____ and find out whether it ____ please". Just like in code optimization, where we avoid unnecessary loops and computations, we want to keep our AI model efficient and not just brutishly consider every possible 8-word sequence (which would give us N^8 combinations, a computational nightmare for a large vocabulary).

Instead, we opt for an approach that considers the current word and each of the previous words, forming key-value pairs. This is somewhat similar to creating a hashmap or dictionary in many programming languages where the key is a combination of the current word and one of the preceding words, and the value is the potential next word. This technique drastically reduces complexity while retaining essential context.

Imagine our model like a voting system for code autocompletion. Each pair of words ('key') votes for what the next word (the 'value') could be. Most pairs ('keys') are not useful because they're common to both commands and don't affect the outcome. The only decisive pairs are ('system', 'rebooted') and ('battery', 'drained').

To enhance our prediction accuracy, we implement a technique similar to 'masking' used in many areas of programming. This process only pays attention to the decisive key-value pairs and disregards the rest, improving our confidence in the next word prediction. This selective attention to relevant keys while ignoring others is the essence of the 'attention' mechanism used in Transformer models.

Therefore, in a nutshell, the Transformer model is like an intelligent auto-complete tool, able to efficiently predict the next command based on preceding context, regardless of how many commands exist in the vocabulary.

After Self-Attention

After the self-attention step in a Transformer model, the output is passed through a feed-forward neural network and the process is repeated for several layers. The final output of this stack of self-attention and feed-forward layers is used to predict the next token.

Let's look at how a simplified Transformer model might predict the next token. Continuing with the same example, let's suppose we're trying to predict the next word after "I love cats". In this example, the next word is unknown, so we'll denote it as <unknown>.

First, the sentence is processed through several layers of self-attention and feed-forward neural networks, which for simplicity, we'll denote as the function transformer_layers:

# Our sentence
sentence = ["I", "love", "cats", "<unknown>"]


# Represent our sentence as a matrix, where each row is a word embedding
embedding_matrix = np.array([embeddings.get(word, np.random.rand(2)) for word in sentence])


# The output of the Transformer layers is a new set of embeddings
transformer_output = transformer_layers(embedding_matrix)        

The transformer_output is a new set of embeddings that capture the context of each word in the sentence, as provided by the other words.

Next, these context-rich embeddings are fed into a final linear layer followed by a softmax function to produce a probability distribution over the possible next words. The linear layer transforms the embeddings into scores for each word in the vocabulary, and the softmax function turns these scores into probabilities.

# For simplicity, let's assume our vocabulary only contains the words we've seen
vocabulary = ["I", "love", "cats"]


# Our linear layer will be a simple dot product with a learned weight matrix
# For simplicity, let's assume the weight matrix is the identity
weight_matrix = np.eye(2)


# We compute the scores for each word in the vocabulary
scores = np.dot(weight_matrix, transformer_output[-1]) # we only care about the last word


# We turn these scores into probabilities with the softmax function
probabilities = np.exp(scores) / np.sum(np.exp(scores))


# This gives us a probability distribution over the next word
next_word_distribution = dict(zip(vocabulary, probabilities))        

The next_word_distribution gives us a probability for each word in the vocabulary being the next word in the sentence. The model would then select the word with the highest probability as its prediction.

This is a very simplified example, and actual Transformer models are much more complex. For example, they include positional encodings to capture the order of the words, use multiple parallel attention heads, and have a much larger vocabulary with a more sophisticated method for handling unknown words. But I hope this example gives you an idea of the basic principles behind how Transformer models predict the next word in a sentence. And it's important to remember that this is all it does nothing more.

Learn More

Refresh linear algebra: dot-product and matrix multiplications.

Attention Is All You Need - Paper Explained

Let's build GPT: from scratch, in code, spelled out. w/ Andrej Karpathy


要查看或添加评论,请登录

勝利 Rangnekar的更多文章

  • Will AI Eat Software Engineering?

    Will AI Eat Software Engineering?

    Will LLMs take coding jobs is the most common point of discussion that I come across with software engineers. This is…

    15 条评论
  • How to Prompt an LLM

    How to Prompt an LLM

    “Write me a tweet about the fall weather in PNW” Don't do this! This above is an example how NOT to prompt LLMs. The…

  • Engineers Guide to AI - Decoding Transformer Models Part 2

    Engineers Guide to AI - Decoding Transformer Models Part 2

    Embeddings With all the excitement around vector databases you've certainly heard the word embeddings or sentence…

  • Engineers Guide to AI - Tokenization

    Engineers Guide to AI - Tokenization

    Tokenization is a common step in text processing, especially in natural language processing (NLP). It's the process of…

  • Engineers Guide to AI - Understanding Positional Encoding

    Engineers Guide to AI - Understanding Positional Encoding

    Positional encodings are very important for Transformer models because they allow the model to use the order of the…