登录查看更多内容

Self-Attention: The Superpower Behind Transformers (Part 1)

Sakshi Singh

Associate Engineer-Data @Shell | Kaggle Discussion Expert | Generative AI | Data Science

发布日期: 2024年12月28日

Have you ever been at a party, trying to follow multiple conversations at once? You naturally tune in to the person whose story catches your attention, while still keeping an ear open for someone mentioning your name across the room. This ability to selectively focus is exactly what makes self-attention in transformers so powerful. It helps AI figure out which words to "listen" to in a sentence and how they all connect.

In this article, we’ll break down self-attention step-by-step, using a simple sentence and visuals to make it crystal clear.

Step 1: Breaking Words into Three Roles

In self-attention, every word in a sentence takes on three roles:

1.??? Query (Q): Represents what this word is looking for.

2.??? Key (K): Represents what this word can offer to others.

3.??? Value (V): Represents the actual meaning or contribution of the word.

Example Sentence:

Let’s take the sentence: "How are you?"

Each word—"How", "are", and "you"—is first encoded into its own representation (denoted as eHow, eAre, and eYou). From this encoding, three unique vectors are created for every word:

QHow, kHow, and vHow for "How".
QAre, kAre, and vAre for "Are".
QYou, kYou, and vYou for "You".

The following image represents this process visually:

Step 2: Comparing Queries and Keys

Once every word has its Query, Key, and Value vectors, the algorithm starts comparing them. Each Query (Q) is compared to the Keys (K) of all words in the sentence to decide how much attention to pay to each word.

Analogy:

Think of this like a team brainstorming session:

Each team member (word) asks: "Who has the most relevant ideas for me?" (Query).
Each team member shares what they know (Key).
Based on the comparison, they decide how much to focus on each person.

领英推荐

Talogy Cubiks Logiks and PAPI Assessments:…

Online Training for Everyone 9 个月前

Logical Reasoning Puzzles: Techniques to Solve Quickly…

Chandragupt Institute of Management Patna (CIMP) 6 个月前

The role model in perspective

Marcel JB Tardif, MBA 8 个月前

For example:

QHow checks the Keys kHow, kAre, and kYou to see which words are most relevant to "How".
Similarly, QAre compares with all Keys, and so on.

Step 3: Calculating Attention Scores

The results of these comparisons are attention scores, which indicate how much focus one word should give to another. These scores are normalized using a softmax function so that all attention values add up to 1, making it easier for the algorithm to weigh them.

In the second visual below, this process is depicted clearly:

1.??? Queries are compared with Keys to generate a matrix of raw attention scores.

2.??? The scores are normalized using softmax to distribute the attention.

3.??? The softmax-weighted scores are finally multiplied with Values (V) to combine information.

Step 4: Combining Information with Values

After calculating attention scores, the algorithm uses them to gather information from the Values (V) of the relevant words. The Values represent the actual content each word contributes to the sentence. The output is a new representation of each word that takes into account its context.

Example:

For the word "How":

It combines information from the Values of "How," "are," and "you" based on their attention scores.
This gives a context-rich representation of "How," embedding its relationship with the other words in the sentence.

This process ensures that every word in the sentence is understood in the context of the others, allowing the model to grasp nuanced meanings.

Stay tuned for Part 2, where we dive deeper into the inner workings of this remarkable mechanism!

Tanishq Singh

Kaggle Discussion Expert | Researcher ??| Learning and Teaching Machine Learning????

2 个月

That party example was good, it works well in a classroom setting as well ??! Great read, waiting for part 2! ????

1 次回应

查看更多评论

要查看或添加评论，请登录

Sakshi Singh的更多文章

Rotary Positional Embeddings Explained Simply ??

2025年3月4日

Rotary Positional Embeddings Explained Simply ??

Let's Talk About Positional Embeddings! ?? Ever played a game where the order of words totally changes the meaning of a…

1 条评论
Discovering SwiGLU: The Activation Function Powering Modern LLMs

2025年1月28日

Discovering SwiGLU: The Activation Function Powering Modern LLMs

Activation functions are the unsung heroes of deep learning. They’re at the core of every neural network, influencing…

2 条评论
RMSNorm: The Simplified Powerhouse Behind Modern LLMs

2025年1月10日

RMSNorm: The Simplified Powerhouse Behind Modern LLMs

In the last article, we dove into Layer Normalization (LN)—a key component in transformers—and understood why it…

1 条评论
Understanding Layer Normalization: Why and How It Works

2025年1月5日

Understanding Layer Normalization: Why and How It Works

In the world of deep learning, you’ve likely come across the term Normalization—a technique used to normalize the…

5 条评论
Unpacking Self-Attention: Diving Deeper into the Transformer’s Core (Part 2)

2025年1月1日

Unpacking Self-Attention: Diving Deeper into the Transformer’s Core (Part 2)

Hello everyone! ?? In the last article, we covered the crux of self-attention that gave us a foundation for…

3 条评论

See all articles

Self-Attention: The Superpower Behind Transformers (Part 1)

Sakshi Singh

Associate Engineer-Data @Shell | Kaggle Discussion Expert | Generative AI | Data Science

领英推荐

Sakshi Singh的更多文章

社区洞察

其他会员也浏览了

How to find what shines

Nonlinear Thinking: A Thought Without Duality

A CHICKEN STORY (RETOLD)

Unlocking Memory Magic: How to Supercharge Recall and Communication

Black Box Thinking

“I hear and I forget. I see and I remember. I do and I understand.”

DISC Profiles - A shorthand for a cold read.

abstractedness

First Principles Thinking

6 Reasons to Make Analog Journaling a Part of Your Life

领英推荐

Sakshi Singh的更多文章

Rotary Positional Embeddings Explained Simply ??

Discovering SwiGLU: The Activation Function Powering Modern LLMs

RMSNorm: The Simplified Powerhouse Behind Modern LLMs

Understanding Layer Normalization: Why and How It Works

Unpacking Self-Attention: Diving Deeper into the Transformer’s Core (Part 2)

社区洞察

其他会员也浏览了

How to find what shines

Nonlinear Thinking: A Thought Without Duality

A CHICKEN STORY (RETOLD)

Unlocking Memory Magic: How to Supercharge Recall and Communication

Black Box Thinking

“I hear and I forget. I see and I remember. I do and I understand.”

DISC Profiles - A shorthand for a cold read.

abstractedness

First Principles Thinking

6 Reasons to Make Analog Journaling a Part of Your Life