登录查看更多内容

Understanding Self-Attention in Transformers

Siva Swetha G

Data Science & GenAI Intern @ Innomatics Research Labs| Statistics, Machine Learning , MySQL, Tableau, Deep Learning, NLP

发布日期: 2025年1月13日

Ever wondered how Transformer models like BERT and GPT understand relationships between words? The secret lies in self-attention—a mechanism that helps models evaluate how every word in a sentence relates to every other word.

Let’s dive in and explore it further

The Sentence

We will work with the sentence: "The cat chased the mouse."

Our goal is to compute self-attention using the formula:

Let’s calculate the attention for all the words step by step.

Step 1 : Covert Words into Embeddings

We assign the following embeddings to the words:

"The": [0.1,0.2,0.3]
"cat": [0.5,0.6,0.7]
"chased": [0.8,0.9,1.0]
"mouse": [1.1,1.2,1.3]

Step 2 : Compute Q, K & V Vectors

We use the following weight matrices:

Step 2.1 : For "The" [0.1,0.2,0.3] :

Step 2.2 : For "cat" [0.5,0.6,0.7] :

Step 2.3 : For "chased" [0.8,0.9,1.0] :

Step 2.4 : For "mouse" [1.1,1.2,1.3] :

Step 3 : Compute Attention Scores

The attention scores between two words is calculated as :

3.1: Compute Scores for "The":

This forms the first row of the score matrix.

3.2: Compute Scores for Remaining Words:

Repeat the process for "cat", "chased" and "mouse", calculating their scores against every other word.

The final score matrix is :

领英推荐

Unlocking LLM Potential with Memory Compression: ARM…

Ganesh Raju 6 个月前

The Unstoppable Disruption of AI on Client Satisfaction

Automaise 1 年前

DMP Soft

DMP Soft ???? ????? 2 年前

Step 4: Normalize the Scores using Softmax

Apply softmax to each row of the score matrix to convert raw scores into probabilities. This step ensures that the scores for each word sum to 1.

Formula for Softmax is :

Softmax for "The" :

Score : [0.150, 0.559, 0.902, 1.261]

Exponentiate each value:

Sum of exponentials : 1.162 + 1.749 + 2.465 +3.529 = 8.905

Normalize each value :

Normalized probabilities for "The": [0.130, 0.196, 0.277, 0.396]

Similarly, calculate for all the rows corresponding to "cat", "chased", "mouse". The final softmax-normalized score matrix is :

Step 5 : Weighted Sum of Values

The final step is to compute the weighted sum of the Value vectors for each word. The weights are the attention probabilities from the normalized score matrix.

For "The" :

Attention probabilities : [0.130, 0.196, 0.277, 0.396]

Value Vectors :

"The" : [0.12, 0.22, 0.32]
"cat" : [0.55, 0.86, 1.17]
"chased" : [0.84, 1.35, 1.86]
"mouse" : [1.21, 2.00, 2.79]

Weighted Sum :

Compute each term :

Sum them up :

This vector represents how "The" is understood, considering its relationship with "cat," "chased," and "mouse."

Perform similar weighted sums for the other words (cat, chased, mouse) using their attention probabilities and Value vectors.

Conclusion

The final contextualized embeddings encode how each word interacts with others in the sentence:

"The" becomes influenced by "cat," "chased," and "mouse", understanding its role in the context.
Similarly, "cat" and others now incorporate relationships from the sentence.

Nandha kumar

PLSQL developer/Facets Developer/Aspiring data scientist

1 个月

Interesting

1 次回应

要查看或添加评论，请登录

Siva Swetha G的更多文章

Understanding Transformers: The Backbone of Modern AI and NLP

2025年1月3日

Understanding Transformers: The Backbone of Modern AI and NLP

If you’ve been exploring the world of Artificial Intelligence (AI) and Natural Language Processing (NLP), you’ve…

2 条评论
Understanding the Central Limit Theorem (CLT) in Simple Terms

2024年12月12日

Understanding the Central Limit Theorem (CLT) in Simple Terms

Have you ever wondered how statisticians make predictions about an entire population by studying just a small sample?…

Understanding Self-Attention in Transformers

Siva Swetha G

Data Science & GenAI Intern @ Innomatics Research Labs| Statistics, Machine Learning , MySQL, Tableau, Deep Learning, NLP

The Sentence

Step 1 : Covert Words into Embeddings

Step 2 : Compute Q, K & V Vectors

Step 3 : Compute Attention Scores

领英推荐

Step 4: Normalize the Scores using Softmax

Step 5 : Weighted Sum of Values

Conclusion

Siva Swetha G的更多文章

社区洞察

其他会员也浏览了

Crafting Intelligence: A Journey Through Prompt Engineering with GPT-3 to GPT-4 and Beyond

Understanding Transformers Made Easy

Automate Repetitive Tasks in Your Workflow

Along comes a Mamba: an evolution in sequence models based on latent spaces

Understanding LLMs from scratch Part 6: The transformer architecture

Mastering Regularization: The Complete Guide to All Strategies

Exploring The Prompt: Playing With Attention

Prompt Engineering with Bedtime Stories

Why AI is Useful but Not Captivating

Trying to explain how GPT works in very simple terms...

The Sentence

Step 1 : Covert Words into Embeddings

Step 2 : Compute Q, K & V Vectors

Step 3 : Compute Attention Scores

领英推荐

Step 4: Normalize the Scores using Softmax

Step 5 : Weighted Sum of Values

Conclusion

Siva Swetha G的更多文章

Understanding Transformers: The Backbone of Modern AI and NLP

Understanding the Central Limit Theorem (CLT) in Simple Terms

社区洞察

其他会员也浏览了

Crafting Intelligence: A Journey Through Prompt Engineering with GPT-3 to GPT-4 and Beyond

Understanding Transformers Made Easy

Automate Repetitive Tasks in Your Workflow

Along comes a Mamba: an evolution in sequence models based on latent spaces

Understanding LLMs from scratch Part 6: The transformer architecture

Mastering Regularization: The Complete Guide to All Strategies

Exploring The Prompt: Playing With Attention

Prompt Engineering with Bedtime Stories

Why AI is Useful but Not Captivating

Trying to explain how GPT works in very simple terms...