Understanding Self-Attention in Transformers

Understanding Self-Attention in Transformers

Ever wondered how Transformer models like BERT and GPT understand relationships between words? The secret lies in self-attention—a mechanism that helps models evaluate how every word in a sentence relates to every other word.

Let’s dive in and explore it further


The Sentence

We will work with the sentence: "The cat chased the mouse."

Our goal is to compute self-attention using the formula:

Let’s calculate the attention for all the words step by step.


Step 1 : Covert Words into Embeddings

We assign the following embeddings to the words:

  • "The": [0.1,0.2,0.3]
  • "cat": [0.5,0.6,0.7]
  • "chased": [0.8,0.9,1.0]
  • "mouse": [1.1,1.2,1.3]


Step 2 : Compute Q, K & V Vectors

We use the following weight matrices:

Step 2.1 : For "The" [0.1,0.2,0.3] :

Step 2.2 : For "cat" [0.5,0.6,0.7] :

Step 2.3 : For "chased" [0.8,0.9,1.0] :

Step 2.4 : For "mouse" [1.1,1.2,1.3] :


Step 3 : Compute Attention Scores

The attention scores between two words is calculated as :

3.1: Compute Scores for "The":

This forms the first row of the score matrix.

3.2: Compute Scores for Remaining Words:

Repeat the process for "cat", "chased" and "mouse", calculating their scores against every other word.

The final score matrix is :


Step 4: Normalize the Scores using Softmax

Apply softmax to each row of the score matrix to convert raw scores into probabilities. This step ensures that the scores for each word sum to 1.

Formula for Softmax is :

Softmax for "The" :

Score : [0.150, 0.559, 0.902, 1.261]

Exponentiate each value:

Sum of exponentials : 1.162 + 1.749 + 2.465 +3.529 = 8.905

Normalize each value :

Normalized probabilities for "The": [0.130, 0.196, 0.277, 0.396]

Similarly, calculate for all the rows corresponding to "cat", "chased", "mouse". The final softmax-normalized score matrix is :


Step 5 : Weighted Sum of Values

The final step is to compute the weighted sum of the Value vectors for each word. The weights are the attention probabilities from the normalized score matrix.

For "The" :

Attention probabilities : [0.130, 0.196, 0.277, 0.396]

Value Vectors :

  • "The" : [0.12, 0.22, 0.32]
  • "cat" : [0.55, 0.86, 1.17]
  • "chased" : [0.84, 1.35, 1.86]
  • "mouse" : [1.21, 2.00, 2.79]

Weighted Sum :

Compute each term :

Sum them up :

This vector represents how "The" is understood, considering its relationship with "cat," "chased," and "mouse."

Perform similar weighted sums for the other words (cat, chased, mouse) using their attention probabilities and Value vectors.


Conclusion

The final contextualized embeddings encode how each word interacts with others in the sentence:

  • "The" becomes influenced by "cat," "chased," and "mouse", understanding its role in the context.
  • Similarly, "cat" and others now incorporate relationships from the sentence.


Nandha kumar

PLSQL developer/Facets Developer/Aspiring data scientist

1 个月

Interesting

要查看或添加评论,请登录

Siva Swetha G的更多文章

社区洞察

其他会员也浏览了