Understanding Transformers: A Mathematical Dive into Each Layer
RAHUL PATIL
Sr. AI/ML Engineer @Accenture || Nvidia NeMo developer|| Azure AI Engineer|| GenAI Engineer || Python Developer || Data Analyst || Tech Blogger
Understanding transformers doesn't mean just knowing how they functionally work. You truly start to grasp their power when you dive deep into the mathematical concepts underpinning each layer. Let’s explore transformers layer by layer:
1. ?????????? ??????????????????
The first layer in a transformer is the input embedding layer, which transforms textual data into a format that the model can process.
?????? ???????????????? ???? ?????????? ??????????????????:
????????????????????????:
???????????? ???????????????????????????? ???? ????????:
?????????????? ?????? ???????????????????? ???????? ???????? ????????????(????,??????,????-??????...)?
?????? ???? ???????????? ????????????:
2. ???????????????????? ????????????????
Transformers lack recurrence or convolution mechanisms. To capture the order of tokens, positional encoding is introduced.
?????? ???????????????? ???? ???????????????????? ????????????????:
????????????????????????? ?????????????????? ????????:
????????? (??????) for even positions.
????????????? (??????) for odd positions.
??????? ???????? ?????? ?????????????:
?These functions capture continuous patterns.
?They are derivatives of each other, allowing the model to learn the relative positional difference between tokens effectively.
3. ??????????-???????? ??????????????????
The attention mechanism is the heart of transformers, allowing the model to focus on different parts of the input simultaneously.
?????? ???????????????? ???? ??????????-???????? ??????????????????:
?????????-??????????????????:
?Computes a token’s relationship with all other tokens in the sequence.
????????????? ????????????????????????????:
?Essential for deriving query (Q), key (K), and value (V) matrices.
?Transpose of Matrices:
????????????? ?????????? ?????? ???????????????? ??????????:
The input matrix is split into multiple heads to enable parallel computation of attention. Let’s illustrate this with an example:
领英推荐
Ex. Input Matrix:
|1 2 3 |
A= |4 5 6 |
|7 8 9 |
|10 11 12 |
?Matrix size: 4×3(4 rows, 3 columns).
?Assume ??=3: Number of heads.
????????????? ?????????? ?????? ????????: Since 3 columns divided by ??=3 , each head will have 3/3=1 ????????????.
Each head gets a slice of the matrix. The slices are:
|1 |
Head 1= |4 |
|7 |
|10|
|2 |
Head 2= |5 |
|8 |
|11|
|3 |
Head 3= |6 |
|9 |
|12|
The original matrix has been split into 3 heads, each of size 4×1.
This matches the idea of multi-head attention, where each head operates on a subset of the feature dimensions (columns in this case).
4.?????????? ?????????????????????????? ???? ?????????? ??????????????????????????
5. ?????????????? ????????????????
SoftMax is used to convert raw attention scores into probabilities.
???????????????????????? ??????????????:
?Emphasizes the largest values while suppressing smaller ones.
?Ensures all probabilities sum to 1.
6. ???????????? ??????????????
Causal Masking (used in autoregressive models like GPT), the mask prevents each token from attending to future tokens, ensuring predictions only depend on previous tokens.
Consider a sequence of tokens [t1,t2,t3]. The mask ensures:
t1 ttends only to ??1.
t2 attends to ??1,??2.
t3 attends to ??1,??2,??3.
The mask matrix for this example (sequence length ??=3) is:
| 0 ?∞ ?∞ |
M = | 0 0 ?∞ |
| 0 0 0 |
7. ???????? ???????????????? ????????????????????
8. ???????? ???????????? ??????????????????
A decoding strategy for generating sequences in tasks like text generation or translation.
Key Concepts:
? Keeps track of the -best sequences at each step (beam width ).
? Balances between exploring diverse options and maintaining high-probability sequences.
Advantages:
? Produces more coherent and contextually relevant outputs compared to greedy decoding.
By understanding these mathematical underpinnings, you gain a deeper appreciation for how transformers function and why they excel across so many NLP tasks.