登录查看更多内容

Understanding Transformers: A Deep Dive with PyTorch

Chirag S.

Staff Engineer, Data Scientist at Micron Technology | OMS Analytics Graduate Student at Georgia Tech | M.S. Operations Research, Northeastern University | Data Science Mentor

发布日期: 2023年9月25日

Transformers, since their inception in 2017 with the paper "Attention Is All You Need" by Vaswani et al., have sparked a renaissance in the world of Natural Language Processing (NLP). They've set new benchmarks and given birth to models like BERT, GPT, and T5. But what makes them so special?

1. Overview of Transformers: Transformers use a mechanism called self-attention to process an entire sequence in parallel. This architectural innovation allows every element of the input sequence to be related to every other element directly, enabling the model to learn contextual relationships between words in a sentence, or elements in a sequence, regardless of their positional distances from each other.

The parallel processing capability of Transformers not only makes them significantly faster than their RNN and LSTM predecessors (which handle text data sequentially) but also more effective at capturing complex, long-range dependencies within the data. This efficiency is further augmented by the ability to scale Transformers horizontally, meaning they can be trained on vast datasets with large numbers of parameters, leveraging modern GPU architectures to their fullest.

2. Key Components:

Self Attention: The mechanism allowing the model to weigh the importance of different words in the input sequence.
Transformer Block: Combines self-attention and feed-forward networks, along with normalization and dropout, to process sequences.
Encoder: Processes the input sequence, embedding it and applying multiple Transformer blocks to it.
Decoder: Generates the output sequence by attending to the encoder output and its previous outputs.
Transformer: The full model that puts together the encoder and decoder, handling masks to ensure proper attention behavior.

This architecture enables complex sequence-to-sequence tasks, such as language translation, by effectively learning relationships between elements in the input and output sequences.

3. Building a Simple Transformer with PyTorch:

import torch
import torch.nn as nn

# Defines the SelfAttention mechanism used in Transformer blocks.
class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        assert (
            self.head_dim * heads == embed_size
        ), "Embedding size needs to be divisible by heads"

        # These are fully connected layers for projecting the inputs.
        self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.fc_out = nn.Linear(heads * self.head_dim, embed_size)

    def forward(self, values, keys, query, mask):
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]

        # Split the embedding into 'heads' number of pieces for multi-head attention
        values = values.reshape(N, value_len, self.heads, self.head_dim)
        keys = keys.reshape(N, key_len, self.heads, self.head_dim)
        queries = query.reshape(N, query_len, self.heads, self.head_dim)

        # Calculate the dot product attention between queries and keys
        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])

        # Apply mask if it is provided (useful for masking out padding in the input)
        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))

        # Softmax to get attention weights
        attention = torch.nn.functional.softmax(energy / (self.embed_size ** (1 / 2)), dim=3)

        # Apply attention to the values
        out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(N, query_len, self.heads * self.head_dim)
        
        # Final fully connected layer
        return self.fc_out(out)

# A Transformer block that combines self-attention and position-wise feedforward layers.
class TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads, dropout, forward_expansion):
        super(TransformerBlock, self).__init__()
        self.attention = SelfAttention(embed_size, heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, forward_expansion * embed_size),
            nn.ReLU(),
            nn.Linear(forward_expansion * embed_size, embed_size),
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, value, key, query, mask):
        attention = self.attention(value, key, query, mask)
        x = self.dropout(self.norm1(attention + query))
        forward = self.feed_forward(x)
        out = self.dropout(self.norm2(forward + x))
        return out

# The Encoder layer that processes the input sequence.
class Encoder(nn.Module):
    def __init__(self, src_vocab_size, embed_size, num_layers, heads, device, forward_expansion, dropout, max_length):
        super(Encoder, self).__init__()
        self.embed_size = embed_size
        self.device = device
        self.word_embedding = nn.Embedding(src_vocab_size, embed_size)
        self.position_embedding = nn.Embedding(max_length, embed_size)
        self.layers = nn.ModuleList([
            TransformerBlock(embed_size, heads, dropout, forward_expansion) for _ in range(num_layers)
        ])
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        N, seq_length = x.shape
        positions = torch.arange(0, seq_length).expand(N, seq_length).to(self.device)
        out = self.dropout(self.word_embedding(x) + self.position_embedding(positions))

        for layer in self.layers:
            out = layer(out, out, out, mask)
        return out

# The Decoder layer that generates the output sequence based on encoder outputs and previous decoder outputs.
class Decoder(nn.Module):
    def __init__(self, trg_vocab_size, embed_size, num_layers, heads, forward_expansion, dropout, device, max_length):
        super(Decoder, self).__init__()
        self.device = device
        self.word_embedding = nn.Embedding(trg_vocab_size, embed_size)
        self.position_embedding = nn.Embedding(max_length, embed_size)

        self.layers = nn.ModuleList([
            DecoderBlock(embed_size, heads, forward_expansion, dropout, device)
            for _ in range(num_layers)])

        self.fc_out = nn.Linear(embed_size, trg_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_out, src_mask, trg_mask):
        N, seq_length = x.shape
        positions = torch.arange(0, seq_length).expand(N, seq_length).to(self.device)
        x = self.dropout(self.word_embedding(x) + self.position_embedding(positions))

        for layer in self.layers:
            x = layer(x, enc_out, enc_out, src_mask, trg_mask)

        out = self.fc_out(x)
        return out

# Defines the complete Transformer model including the encoder and decoder
class Transformer(nn.Module):
    def __init__(self,
                 src_vocab_size,
                 trg_vocab_size,
                 src_pad_idx,
                 trg_pad_idx,
                 embed_size=256,
                 num_layers=6,
                 forward_expansion=4,
                 heads=8,
                 dropout=0.1,
                 device="cuda",
                 max_length=100):
        super(Transformer, self).__init__()
        self.encoder = Encoder(
            src_vocab_size,
            embed_size,
            num_layers,
            heads,
            device,
            forward_expansion,
            dropout,
            max_length,
        )
        self.decoder = Decoder(
            trg_vocab_size,
            embed_size,
            num_layers,
            heads,
            forward_expansion,
            dropout,
            device,
            max_length,
        )
        self.src_pad_idx = src_pad_idx
        self.trg_pad_idx = trg_pad_idx
        self.device = device

    # Creates a mask for the source sequence to prevent attention to padding tokens
    def make_src_mask(self, src):
        src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)
        return src_mask.to(self.device)

    # Creates a mask for the target sequence to ensure the decoder can only attend to previous tokens
    def make_trg_mask(self, trg):
        N, trg_len = trg.shape
        trg_mask = torch.tril(torch.ones((trg_len, trg_len))).expand(
            N, 1, trg_len, trg_len
        )
        return trg_mask.to(self.device)

    # Forward pass of the model
    def forward(self, src, trg):
        src_mask = self.make_src_mask(src)
        trg_mask = self.make_trg_mask(trg)
        enc_src = self.encoder(src, src_mask)
        out = self.decoder(trg, enc_src, src_mask, trg_mask)
        return out

4. Why Transformers Matter: The parallel processing power of Transformers combined with the self-attention mechanism has made them the state-of-the-art model for many NLP tasks. From translation services to chatbots, they've revolutionized the way machines understand language.

The Power of Parallel Processing

One of the groundbreaking features of Transformer models is their ability to process entire sequences of data simultaneously. Unlike their predecessors, such as RNNs and LSTMs, which process data sequentially and therefore are limited by longer processing times for longer sequences, Transformers leverage parallel processing. This means that they can handle sequences in their entirety, without needing to process one element at a time. This characteristic drastically reduces training times and allows for the handling of larger datasets more efficiently.

Brij kishore Pandey 3 周前

Understanding transformers from first principles -…

Ajit Jaokar 1 年前

Vector Search in AI and Its Advantages Over LLMs and…

Jean KO?VOGUI 4 个月前

Mastery of Self-Attention

The self-attention mechanism is at the heart of the Transformer's success. It enables the model to dynamically weigh the relevance of different parts of the input data. For example, in a sentence, the model can learn to pay more attention to subjects when processing verbs, allowing it to understand context and nuances in language with remarkable effectiveness. This ability to understand and generate contextually relevant text has made Transformers exceptionally good at a range of tasks from summarization to content creation.

Versatility Across Domains

While Transformers originated in the field of NLP, their influence has spread to other areas of deep learning. Variants like Vision Transformers (ViTs) have shown impressive results in image classification tasks, demonstrating the architecture's versatility. The core principles of parallel processing and attention mechanisms have proven valuable across different types of data, making Transformers a go-to model for a wide array of machine learning challenges.

State-of-the-Art Performance

Transformers have consistently set new benchmarks for a variety of NLP tasks. Models like GPT (Generative Pretrained Transformer) and BERT (Bidirectional Encoder Representations from Transformers) have achieved unprecedented accuracy in tasks such as machine translation, question-answering, and text classification. Their ability to capture deep linguistic structures and context has dramatically improved the quality of machine-generated text, making technologies like chatbots and virtual assistants more reliable and human-like.

Revolutionizing Machine Understanding and Interaction

The advancements brought by Transformers have significantly improved machine understanding of language, leading to more natural and effective human-computer interactions. Translation services have become more accurate and nuanced, making global communication easier. Chatbots and virtual assistants can provide more relevant and contextually appropriate responses, enhancing user experiences. Furthermore, the ability of these models to generate coherent and context-aware text has opened up new possibilities in content creation, from writing assistance to generating entirely new creative works.

Conclusion

The introduction and evolution of the Transformer architecture have undoubtedly revolutionized the field of deep learning. By enabling more efficient processing, deeper understanding of context, and versatile application across domains, Transformers have become an indispensable tool in the machine learning toolkit. As we continue to explore the limits and potential of these models, they promise to drive further innovation in NLP and beyond, making our interactions with machines more seamless and intuitive. Whether for translating languages, generating text, or even interpreting complex data in new ways, the Transformer architecture stands as a pillar of modern AI research and application.

#DeepLearning #Transformers #PyTorch #NLP

OTOBONG EDEMENANG

Artificial Intelligence Data Analyst

6 个月

Thank you for sharing.

Sai Krishna Budi

Data Science Intern @ Ai Variant | Statistical Analysis, Machine Learning, AWS Cloud

6 个月

very valuable information

1 次回应

Nicole Harris

Consumer Product Strategy Analyst - On Loan to [Model and Data Mangement]

11 个月

Thanks for the contribution!

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Understanding Transformers: A Deep Dive with PyTorch

Chirag S.

Staff Engineer, Data Scientist at Micron Technology | OMS Analytics Graduate Student at Georgia Tech | M.S. Operations Research, Northeastern University | Data Science Mentor

The Power of Parallel Processing

领英推荐

Mastery of Self-Attention

Versatility Across Domains

State-of-the-Art Performance

Revolutionizing Machine Understanding and Interaction

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

How Generative AI Is Disrupting the Data Economy and Creating New Opportunities

Mastering Long Document Insights: Advanced Summarization with Amazon Bedrock and Anthropic Claude 2 Foundation Model

Explore the Power of Task-Specific Transformer Models with Amazon SageMaker and Hugging Face

Issue #205 - THE ML ENGINEER???

Demystifying Tokenization: Preparing Data for Large Language Models (LLMs)

Top LLM Papers of the Week (July Week 2, 2024)

Comprehending Retrieval-Augmented Generation: The What and How

Can We Really Hand-Engineer Level 2+ AGI?

Transformers on Hugging Face: A Beginner's Guide

Exploring Tools and Frameworks for Building LLM Applications

The Power of Parallel Processing

领英推荐

Mastery of Self-Attention

Versatility Across Domains

State-of-the-Art Performance

Revolutionizing Machine Understanding and Interaction

Conclusion

Simulating a Single Server Queue in Python

2024年6月8日

Types of Sampling in Machine Learning

2023年11月5日

?? Optimizing Insurance Claims Classification through Advanced NLP and XGBOOST Deployment ??

2023年10月8日

The Power and Performance of List Comprehension in Python

2023年9月28日

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

2023年9月27日

Building an XGBoost Multi-class Classification Model using PySpark on Azure Databricks

2023年9月27日

A Deep Dive into Convolutional Neural Networks (CNNs) on LinkedIn

2023年9月22日

Handling Big Data with XGBoost and Azure Databricks: From EDA to Deployment

2023年7月15日

Gradient Descent and its Applications in Deep Learning

2023年6月2日

My Review of Georgia Tech's Online Master of Science in Analytics So Far

2021年10月11日

社区洞察

其他会员也浏览了

How Generative AI Is Disrupting the Data Economy and Creating New Opportunities

Mastering Long Document Insights: Advanced Summarization with Amazon Bedrock and Anthropic Claude 2 Foundation Model

Explore the Power of Task-Specific Transformer Models with Amazon SageMaker and Hugging Face

Issue #205 - THE ML ENGINEER???

Demystifying Tokenization: Preparing Data for Large Language Models (LLMs)

Top LLM Papers of the Week (July Week 2, 2024)

Comprehending Retrieval-Augmented Generation: The What and How

Can We Really Hand-Engineer Level 2+ AGI?

Transformers on Hugging Face: A Beginner's Guide

Exploring Tools and Frameworks for Building LLM Applications