Understanding Transformers: A Mathematical Dive into Each Layer
Image Generated by AI

Understanding Transformers: A Mathematical Dive into Each Layer



Understanding transformers doesn't mean just knowing how they functionally work. You truly start to grasp their power when you dive deep into the mathematical concepts underpinning each layer. Let’s explore transformers layer by layer:




1. ?????????? ??????????????????

The first layer in a transformer is the input embedding layer, which transforms textual data into a format that the model can process.

?????? ???????????????? ???? ?????????? ??????????????????:

????????????????????????:

  • Breaking down text into smaller units, such as words, subwords, or characters.
  • Basic Technique use(Porter Stemmer, Snowball stemmer,Lemmatization, regex)
  • These tokens form the foundation for further processing.

???????????? ???????????????????????????? ???? ????????:

?????????????? ?????? ???????????????????? ???????? ???????? ????????????(????,??????,????-??????...)?

?????? ???? ???????????? ????????????:

  • Often employed for representing large amounts of tokenized text where most entries are zeros.



2. ???????????????????? ????????????????

Transformers lack recurrence or convolution mechanisms. To capture the order of tokens, positional encoding is introduced.

?????? ???????????????? ???? ???????????????????? ????????????????:

????????????????????????? ?????????????????? ????????:

????????? (??????) for even positions.

????????????? (??????) for odd positions.

??????? ???????? ?????? ?????????????:

?These functions capture continuous patterns.

?They are derivatives of each other, allowing the model to learn the relative positional difference between tokens effectively.


3. ??????????-???????? ??????????????????

The attention mechanism is the heart of transformers, allowing the model to focus on different parts of the input simultaneously.

?????? ???????????????? ???? ??????????-???????? ??????????????????:

?????????-??????????????????:

?Computes a token’s relationship with all other tokens in the sequence.

????????????? ????????????????????????????:

?Essential for deriving query (Q), key (K), and value (V) matrices.

?Transpose of Matrices:

????????????? ?????????? ?????? ???????????????? ??????????:

The input matrix is split into multiple heads to enable parallel computation of attention. Let’s illustrate this with an example:

         Ex. Input Matrix:

                |1  2  3 |

         A=  |4  5  6 |

                |7  8 9 |

                |10 11 12 |        

?Matrix size: 4×3(4 rows, 3 columns).

?Assume ??=3: Number of heads.

????????????? ?????????? ?????? ????????: Since 3 columns divided by ??=3 , each head will have 3/3=1 ????????????.

Each head gets a slice of the matrix. The slices are:

                        |1 |
         Head 1= |4 |
                        |7 |
                        |10|

                         |2 |
          Head 2= |5 |
                         |8 |
                         |11|

                         |3 |
          Head 3= |6 |
                         |9 |
                         |12|        

The original matrix has been split into 3 heads, each of size 4×1.

This matches the idea of multi-head attention, where each head operates on a subset of the feature dimensions (columns in this case).


4.?????????? ?????????????????????????? ???? ?????????? ??????????????????????????


5. ?????????????? ????????????????

SoftMax is used to convert raw attention scores into probabilities.

???????????????????????? ??????????????:

?Emphasizes the largest values while suppressing smaller ones.

?Ensures all probabilities sum to 1.


6. ???????????? ??????????????

Causal Masking (used in autoregressive models like GPT), the mask prevents each token from attending to future tokens, ensuring predictions only depend on previous tokens.

Consider a sequence of tokens [t1,t2,t3]. The mask ensures:

t1 ttends only to ??1.

t2 attends to ??1,??2.

t3 attends to ??1,??2,??3.


 The mask matrix for this example (sequence length ??=3) is:

           | 0 ?∞ ?∞ |

   M = | 0   0  ?∞ |

           | 0   0     0  |        

7. ???????? ???????????????? ????????????????????


8. ???????? ???????????? ??????????????????

A decoding strategy for generating sequences in tasks like text generation or translation.

Key Concepts:

? Keeps track of the -best sequences at each step (beam width ).

? Balances between exploring diverse options and maintaining high-probability sequences.

Advantages:

? Produces more coherent and contextually relevant outputs compared to greedy decoding.


By understanding these mathematical underpinnings, you gain a deeper appreciation for how transformers function and why they excel across so many NLP tasks.

要查看或添加评论,请登录

RAHUL PATIL的更多文章

社区洞察

其他会员也浏览了