Visualization of Mathematical  Engineering of Transformers - Part 1

Visualization of Mathematical Engineering of Transformers - Part 1

Let's review some basic concepts before we deep dive into the mathematical engineering hack of transformers.

Word Embeddings

In the field of Natural Language Processing, an embedding of a given word is simply the representation of a word. More formally, it is a technique that represents words as real valued vector in a lower dimension real-vector space to capture inter-word semantics and syntactic information.

The number of vectors in the basis of a real valued vector space V is the dimension of V.

  • dim(V) = # of basis vectors
  • For a 2-dimensional space, we have two basis vectors (0, 1) and (1, 0)
  • For a 3-dimensional space, we have three basis vectors (0,0,1), (0,1,0) and (1,0,0)

So if a word w belongs to a 2-dimensional space, it simply means that w can be represented by a vector (a1, a2) where a1 and a2 are real numbers.

The vector (a1, a2) is known as word embedding or embedding vector or simply embedding for the word w and size of it's embedding is 2

Basically, if some collection of words belongs to a k-dimensional real-vector space, then size of the embeddings = dim(V) = k.

Visual representation of the Dot Product (or Scalar product)

Fig 1. Projection of vector A onto vector B is A cos(θ)

For two vectors A and B, the dot product A.B is the amount of A in the direction of B times the magnitude of B. This interpretation is extremely useful if you are interested in finding out how much of one vector is projected onto another or how similar the two vectors are in direction.

Assume A is (a1, a2, a3) and B is (b1, b2, b3), then

A.B = a1*b1 + a2*b2 + a3*b3

i.e the dot product of two vectors is the sum of the multiplication of each component, giving a scalar result.

Corpus

A corpus can be defined as a collection of machine-readable authentic text that is sampled to be representative of a particular language or language variety.

One of the commonly accepted defining features of a corpus, which distinguishes a corpus from an archive (i.e. a random collection of texts), is representativeness. A corpus is designed to represent a particular language or language variety whereas an archive is not.

Put simply, a corpus is a large collection of written or spoken texts that is used for language research.

In general, “Representativeness refers to the extent to which a sample includes the full range of variability in a population.”

According to Leech (1991) a corpus is thought to be representative of the language variety it is supposed to represent if the findings based on its contents can be generalized to the said language variety.

The textual structure of a Corpus can be thought of as a collection of sentences or paragraphs.

Let us assume that we have a corpus of motivational texts and one of the sentence in the corpus, that we wish to analyze, is the following:

"You become what you think about"

If we split the sentence into individual words or tokens, we can see that this sentence is made up of the following six individual words:

'you' , 'become' , 'what' , 'you' , 'think', and 'about'

which can be thought of as discrete elements, representing chunks of information. These discrete elements are called tokens, and the process of splitting is called tokenization. More specifically, it is known as word-level tokenization.

For the rest of the article we would be considering the same sentence and let's denote it as S.

i.e S = you become what you think about

Now let's now analyze the process of splitting a sentence in a more formal way.

Tokenization

Tokenization in NLP is a way of splitting the unstructured data or text data into discrete chunks of information, known as tokens.

This is usually the first process in any NLP task.

Transformer Architecture

The transformer architecture consist of two main blocks:

  1. Encoder block - the first (left) block corresponds to an encoder
  2. Decoder block - the second (left) block corresponds to a decoder

Fig2. The encoder-decoder structure of the Transformer architecture taken from “Attention is all you need"

Encoder Block

Let's analyze the information flow step by step in order to get a better sense of the working of encoder block.

Fig 3. Encoder Block

The first step in to get an input sentence and prepare input embedding.

Lets take the following sentence as an input:

you become what you think about

Next step is to create an input embedding from the input sentence.

Before that, let's fix some notations that we'll use throughout the article

Notation:

Table 1. Notations used in our example case and their values

Input Sentence - you become what you think about

In general, any sentence S can be represented by a sequence of tokens or words, depending on the type of tokenization technique used for splitting the sentence into individual tokens.

So if we tokenize the input sentence using word-level tokenization we'll get the following sequence.

Input sequence = {you, become, what, you, think, about}

| Input Sequence | = 6

The length of the sequence of input sentence is six as it has six tokens

More generally, any sentence with sequence length of N can be expressed as follows:

S = {w1, w2, ......, wN}

Now consider a six words sentence in a generic way:

S = {w1, w2, w3, w4, w5, w6}

Also lets assume that we represent these embedding in a 512-dimensional vector space i.e each word w in the sequence can be represented by a vector having 512 dimensions. In other words, size of embedding is 512.

If we consider a single i-th word in the sequence, it will have the following vector representation:

i-th word embedding vector

Further, embeddings of every word or token in the sequence can be nicely depicted in form of a matrix as follows:

Fig 4. Matrix representation of input tokens

Now let us denote our input matrix by X which is of size 6 by 512 and is as follows:

Fig 5. Representation of Input Matrix

Input Embeddings

To get input embeddings for a sentence we tokenize it and then for each word we find its index in the vocabulary. These embeddings are randomly initialized at the beginning and then the model tries to tune them during training process.

Summary of steps

Fig 6. Input Embeddings

Positional Embeddings

  • We want each word to carry some information about its relatives position in the sentence
  • We want the model to treat words that appear close to each other as “close” and words that are distant as “distant”
  • We want the model the positional encoding to represent a pattern that we want model to learn
  • Converts a scalar (position index) into a vector (of embedding size)

Positional Encodings of a token is computed by the following expression:

Positional Encodings Formula

Note:

  • That position of first word in sequence starts with 0 denoted by pos, and the same is also true for the dimension number (or depth) denoted by i.
  • The position is relative to the position of first token with position 0, in a sentence

Lets us consider a sentence of three words just for illustration purpose.

Sentence 1: Physics is great

It can be tokenized into three separate tokens as "physics", "is" and "great".

  • The position index for the token "physics" is 0
  • The position index for the token "is" is 1, and
  • The position index for the token "great" is 2

Then, as each token is a vector of size 512, so we need to make positional encoding of same size so that input embedding vector for each word can be added element wise with its positional embedding vector to get the output vector of same size 512.

Fig 7. Steps to compute Positional Encoding

Now let us consider another sentence just for the illustration purpose:

Sentence 2: Maths is fun

It can be tokenized into "Maths", "is" and "fun".

Now since positional encoding depends only on the position of the token and the depth (or component) of vector, it will also have the same positional embeddings.

Fig 8. Positional embeddings for sentence 2


So the conclusion is that once you compute the positional embeddings of sequence of length M, it will be same for all the other sentences in the corpus.

Sequence of length M can be thought of as {w1, w2, w3, . . . , wM}, where each w1, w2, ..., wM are simply the tokens of a sentence having M words.

Hence, these positional encodings are only computed once and are generally stored in memory buffer so that it can be reused again.

Now let us visualize the steps to compute the positional encodings of our input sentence:

Input Sentence: you become what you think about

Fig 9. Steps to compute Positional Embedding

And after computing the values we'll have the following positional embedding vectors for each token:

Fig 10. Positional Embeddings for Input Sentence


Encoder's Input

The net final embeddings will be the element wise sum per token of input embeddings and positional embeddings. And these resultant embeddings will serve as input to encoders, and hence are known as encoder input embeddings.

Summary of steps is as follows:

Fig 11. Encoder's Input (output of positional encoding layer)


Now, first let's consider only the self attention, i.e attention with one head, and then we'll see how to compute multi-head attention. This will gives us a better sense of the math involved.

Self Attention

  • Self allows model to relate word with each other
  • In our simple case we consider the sequence length, M=6,?and d_model = 512. The matrix Q, K, V are just the copies of Encoders input.

In order to compute self attention we need to do the following three steps:

  1. Compute scaled dot product Matrix
  2. Apply Softmax on the scaled dot product along it's rows
  3. Multiple soft scaled dot product with values V


1. How to compute scaled dot product matrix?

Compute scaled dot product by multiplying Query matrix Q with transpose

of Key Matrix K.

Fig 12. Scaled Dot product (scaled by a factor 1/embedding size)

Note: In case of self-attention with single head, size of query = size of key which is equal to size of embeddings = 512.

Generally, in multi-head attention we scale it with the dimension of key which is d_k

Why do we need to scale it with 1/sqrt(key size) ?

In self attention with single head, d_k = d_model. Let's consider the scenario when we do not use the scaling factor. Now let's consider a single row of Q denoted as q, and a single row K denoted as k. Both q and k have equal dimensions equal to d_k.

Further, assume that components of both q vector and k vector are uniformly distributed with mean 0 and variance 1.

Lets us compute the variance of the dot product, i.e q.k, in the following way:

Mean and variance of non-scaled dot product

Now let's consider the scenario of scaled dot product:

Mean and Variance of scaled dot product

So here the idea is that if you have larger dimensions of keys and queries, and if we do not scale the dot product matrix, then the values in the cells of this matrix (which is nothing but a dot product of some query q and key k) would have a larger variance and is equal to the size of the key. This further means that if our key size is large, then there is a higher probability that some of the rows might have values that are relatively larger than other values in the same row. And if we apply softmax operation (as we apply it row-wise) to such rows, then softmax will push the values of such rows into regions that are highly unevenly distributed. And so some of the output values (after applying softmax) may get very large and majority of them will get very small, thereby leading to a problem of vanishing gradient.

But if we scale the dot product, then the values of dot products for each cell, will have a mean of zero and variance of 1 thereby ensuring that values in a row do not get arbitrary very large.

Note: For smaller size of key and query ,there will be smaller variance in regular dot product and hence the values will also be smaller so scaled and non-scaled dot product will behave more or less similar. To get the feeling of this statement please analyze the math of mean and variance computation for dot product.

2. How to compute soft scaled dot product matrix?

Apply softmax operator to Scaled Dot Product Matrix along (row wise) to get soft scaled dot product matrix.

i.e. Softmax(Scaled Dot Product) = Soft Scaled Dot Product

Fig 13. Soft Scaled Dot Product

3. How to compute self attention?

Finally, let's compute the self attention by multiplying the soft scaled dot product matrix with values matrix V.

Fig 14. Self Attention

Properties of Self-Attention

  • Self-attention is permutation invariant (if we do not consider positional encodings)
  • Self-attention requires no additional parameters. So far, the the interaction between words has been driven only by their embeddings and positional encodings. (Later, In Multihead, we'll add parameter matrices so this will change)
  • Values along the principal diagonal are expected to have higher values
  • If we don't want some positions or tokens to interact, we can always set their value to - inf (minus infinity) before applying the softmax i.e in the scaled dot product matrix and the model will not learn those interactions. Note: This an important technique that we will use in masked multihead attention of decoder. Also this technique is used for the padded tokens as we don't want these special tokens to interact with the regular tokens


What is permutation invariant property?

Lets say we have a input matrix and after applying self-attention we get another matrix of same size, then permutations of rows in input matrix will result in the same permutation of self-attention matrix.

Let us visualize the scenario to get a better sense of this property.

Fig 15. Permutation Invariance


Why diagonal values are expected to be be highest?

Scaled dot product is essentially a similarity score of each word with every other word. And so we expect that each word will have the highest similarity with itself that appear on the diagonal cells, and hence softmax operation result in the highest values along the diagonal.

Let us check in the scaled dot product matrix.

Fig 16. Scaled Similarity Score is highest in principal diagonal


How to stop interaction between any two words and why it happens?

To stop interaction between any two words, or in other words, if you don't want a word to attend to a particular word, you need to replace its value with a large negative number (generally we use - infinity) before we apply softmax.

If a word have zero interaction with some other word, it simply means that the value in the attention matrix should be zero.

Let's say we don't want the word 'become' to interact with 'what'. Also we don't the word 'what' to interact with 'about'.

It means that we should have the following self-attention matrix:

Fig 17. Self-Attention matrix with few zero interactions

How to get this matrix?

Simply replace the corresponding values with - inf, before applying softmax. i.e. we should replace these values in scaled dot product matrix.

Let's visualize it through the following figure.

Fig 18. Procedure to enforce non-interaction

Now let's see what is multihead attention.

Multi-Head Attention

As the name suggest, Multi-head attention is attention mechanism with multiple heads. It is represented by the following expressions:

Multi-Head Attention Equations

Multihead attention will essentially take every word, combine it with some of the other words through the attention mechanism, to essentially produce better embeddings that merges together information from pairs of words, pair of pair of words and so on.

Let's try to visualize the mechanism of Multi-head attention to get a better sense of it.

In the encoder block, we can see four arrows coming out of Encoder block.

So this essentially means that we will make four copies of X and each copy has its own purpose as follows:

  1. Query Q,
  2. Key K,
  3. Value V
  4. and residual connection

Fig 19. Encoder Block

Three copies of X (i.e Q, K and V) are sent to the multihead attention block and fourth copy is sent as a residual connection to the normalization layer.

So what doe Multihead attention do?

First of all it multiplies the three matrices, Q, K and V (which are essentially same as matrix X) with corresponding parameter matrices, also called as weight matrices to get three new matrices, known as projected matrices. These projected matrices (each for Q, K and V) have the same dimensions as the input matrix X

Further, these matrices are split into smaller matrices (depending on the number heads) along the d_model (embedding size) dimension.

So every head will see the full sentence but a smaller part of the embeddings of each word, i.e if we have an embedding of 512 (say) and number of heads h=4, it will become smaller embeddings of size 512/4 = 138 dimensions denoted as d_k.

Fig 20. Equation to compute dimension of smaller matrices

We can compute attentions for these smaller matrices q_i, k_i and v_i using the following expressions:

Attention Equations

And this will result into a smaller matrix called head_i, where i is the index representing the i-th head

MultiHead Attention Mechanism

All the heads are then concatenated and multiplied by a weight matrix W_O to get the MH-A (multihead attention matrix) that captures information from all the attention heads, and is finally send to FFNN to get a final matrix (output of the encoder layer) which will then serve as a new input for the next encoder layer and then the same process will repeated N times as we have N number of encoders.

To be continued ... in Part-2


References:

- Vaswani et al. (2017), Attention is all you need https://arxiv.org/pdf/1706.03762

Ashutosh Shukla, Ph.D.

Senior AI/ML Scientist at NatWest Group (Data and Analytics)

9 个月

I'm so grateful for the time and effort you've invested in creating this valuable resource. It's a real asset to the mathematical community

Emrul Zawad

Software Engineer, NLP, Rasa, LLM, Prompt Design

10 个月

The article is visually compelling and clearly presented, in my opinion! Nicely Done!

sandeep saurabh

Artificial Intelligence Consultant

10 个月

The article on the Visualization of Mathematical Engineering of Transformers - Part 1 offers a fascinating insight into the complex mathematical underpinnings of transformer models. As transformers continue to revolutionize natural language processing and other fields, understanding their mathematical foundations becomes increasingly crucial. This article provides a clear and accessible explanation, making it valuable for both newcomers and seasoned practitioners in the field. I'm looking forward to exploring future parts of this series to delve deeper into this intricate topic

Piyush Kapoor

Senior Data Scientist

10 个月

Great informative

要查看或添加评论,请登录

Akash K.的更多文章

社区洞察

其他会员也浏览了