登录查看更多内容

What are the Transformers?

Arun Ohm

Gen AI Intern at BI Hub Solution | Passionate Learner | Problem solver | Machine Learning | Gen AI

发布日期: 2024年10月22日

With all the buzz around Generative AI Tools like ChatGPT,?Gemini, DALL-E2, AlphaCode, etc, that uses Large Language Models (LLMs) (like GPT, BERT, Cohere, LLAMA, Mistral, etc), it is crucial to look at the work that influenced it all.

Background: The Pre-Transformer Era

Before Transformers, NLP models heavily relied on Recurrent Neural Networks (RNNs) and their more sophisticated siblings, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks.

These models were capable of processing sequential data (which means they can process text one word at a time) with a degree of context awareness. (Important point to mark!)

While RNNs and LSTMs had their respective moments of glory, these models had their own limitations:

Long-term Dependencies: RNNs often forget the information from earlier in the sequence.
Sequential Computation: They processed words one at a time, making them slower.
No Parallelization: Their sequential nature made it hard to perform parallelize computations.

From LSTMs to LLMs, we had witness the large advancements in the domain of Sequence to Sequence Learning.

Well, before diving into the transformers, it’s important to note that the origin of Transformers was seeded by an improvement proposed in the Encoder-Decoder Architecture by Ilya Sutskever and his team through their paper “Sequence to Sequence Learning with Neural Networks” (2014–2015).

What is “Attention is All You Need”?

At the heart of LLMs breakthrough lies the key paper “Attention Is All You Need,” by Vaswani et al. and the group of researchers from Google Brin, published in 2017. Despite its deceptively straightforward title, this paper has completely changed the approach used for machine learning tasks involving sequential data.

What is the Transformer?

The transformer is a set of neural networks layers that consist of an encoder and a decoder with self-attention capabilities, that tosses aside the limitations of RNNs and its variants.

Instead of processing words sequentially (one by one per timestamp), transformers can handle entire sentences or documents at once by it processing parallelly. This approach not only made them faster but also more accurate in capturing the context of words in a sentence (will have detailed discussion around this in later articles).

Breaking Down the Transformer Architecture

1. Input Embedding

First, the input sequence of text is converted into a fixed-size vectors or input embeddings, capturing the lexical and syntactic features of the text.

Well, this layer maps each token to a high-dimensional embedding space where semantically similar tokens stay closer.

Consider the sentence: “Transformers enhance LLM capabilities”, here the tokens “Transformers,” “enhance,” “LLM,” and “capabilities are transformed into embeddings, where “Transformers” and “LLM” will stay closer.

2. Positional Encoding

Since transformers process the entire sentence at once, they need a way to remember the order of words. Positional encoding is added to the token embeddings to provide information about the position of each token in the sequence.

Note: It is further useful for a model to distinguish between tokens with the same embedding but at different positions.

As shown in illustration, point-wise positional encoding is added to the respective token embedding, to help the model better understand the sequence order.

Positional embeddings are added to respective embeddings

3. Encoder-Decoder Structure

The Transformer model follows an encoder-decoder architecture:

Encoder: The left part as shown in the image, processes the input sequence and generates a representation.
Decoder: This right part takes the hidden states generated by the encoder as input and the previously generated output tokens to generate an output sentence.

4. Attention Layers

At the core of the transformer resides an Attention Mechanism, which enhances the encoder-decoder architecture capabilities by enabling the model to focus on different parts of the input sequence dynamically.

There are three types of attention mechanisms in transformers model:

Self-Attention: Each word in a sentence pays attention to every other word, including itself, to find the relative importance in understanding the context to generate text specific contextual embeddings.

Here, in our case, “Transformers” attends to “enhance,” “LLM,” and “capabilities”, to understand its contextual importance (i.e., how it relates to these words).

领英推荐

Brief History In Time: Decoding the Evolution of…

CSM Technologies 1 年前

Janus Pro 7B vs DALL-E 3: A Comparative Analysis

The-Next-Tech 2 周前

How Large Language Models Work?

Auxiliobits 11 个月前

Multi-Head Attention: Applying two or more self attentions in parallel to capture the wide contextual perspectives from the sentence.

While predicting “capabilities,” the decoder might focus on the encoder’s contextual embeddings for “Transformers,” “enhance,” and “LLM”, thereby attending to the relevant parts of the input sequence.

Masked Self-Attention: Used in decoder, it ensures that each word can only pay attention to previous words in the sequence, preventing cheating by looking ahead.

An auto-regressive model is a self-predictive model. It predicts a word, then that word is used to predict the next word, which is used to predict the next word, and it goes on till the mentioned number of tokens expire.

5. Feed-Forward Networks

After the attention mechanisms, the model passes the information through the position-wise feed-forward networks, applying fully connected layers independently to each position in the sequence, enabling the model to capture complex non-linear relationships between tokens.

6. Layer Normalization and Residual Connections

The “Add & Norm” operation in a Transformer involves adding the input to the output of the feedforward network and then normalizing the combined result. This process helps stabilize training and promotes effective information (gradient) flow through the network and residual connections.

Output flow from the decoder stack to predict the token of the transformer

7. Linear Layer:

The normalized sequence of vectors from the last decoder layer, capturing the contextualized representation of the tokens one for each position in the input sequence in passed through a linear layer.

Architecturally, the linear layer is a fully-connected NN layer that applies a linear transformation to the input using weight matrix and bias vector.

8. Softmax Function:

After the linear transformation, a softmax function is applied on the output to produce a probability distribution over the vocabulary for each position in the sequence.

The softmax function is a common activation function that converts the logits into probabilities. It ensures that the output values sum up to 1, from which the most likely token is selected as output.

This probability distribution represents the model’s confidence in each possible token for the given position being the next word in the output sequence.

9. Output Prediction:

During training, the model uses teacher forcing method, where the true previous token is fed into the decoder at each step.

Whereas, during inference, the model can select the most probable token (using greedy search) or sample from the probability distribution (by selecting the token of highest probability) or you can also use more advanced techniques like beam search to generate the next token in the sequence in an auto-regressive manner.

The predicted output token is fed back into the decoder as input for the next time step, along with previously generated tokens and the encoder’s hidden states.

This process is repeated iteratively until an end-of-sequence token (e.g., <eos>) is generated or a predetermined maximum length is reached.

Why Transformers are created?

Transformers are the backbone of many state-of-the-art NLP models, including BERT, GPT, T5, etc, as they offers:

Handling Long-Term Dependencies: Due to attention mechanism they don’t suffer from the memory loss issues.
Parallelization: It can process entire corpus at once using parallel processing, making them much faster than RNNs.
Speed and Efficiency: Transformers are designed for parallel processing, allowing them to leverage modern hardware like GPUs and TPUs more effectively.
Versatility: Nowadays, they are backbone of the Gen-AI applications of different domains, including image processing, music generation, and even reinforcement learning.

Real World Applications of Transformers

Transformers have found their way into numerous machine and deep learning applications, transforming how we interact with technology nowadays.

Revolution in NLP Domain: Transformers are the brains behind models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) models, setting new benchmarks for tasks like sentiment analysis, machine translation, named entity recognition, and question-answering systems.
Acceleration of Gen-AI: ChatGPT,?Gemini, CodeX, etc use transformers to generate human-like text. These models are capable of writing essays, poetry, summarizing text, and even generate code.
Speech Recognition: Voice Assistants like Siri and Alexa uses transformers in speech recognition, enabling more accurate response.
Unification of Deep Learning: Further, Transformers have shown promise in Computer Vision and Reinforcement Learning domains, pushing the boundaries of what’s possible in different fields.

Challenges and Future Directions

While Transformers have achieved remarkable success, they’re not without their challenges.

1. High Computational Cost (Time and Resources) required in training Transformers.

2. Low Interpretability of these “black box” models, makes it hard to understand how they make decisions.

3. Ensuring fairness and reducing bias (overfitting) in Transformer models is a critical area of ongoing research.

4. Scalability becomes increasingly challenging with increase in params. Techniques like model pruning, quantization, and knowledge distillation are being explored to address this issue.

Conclusion

In a nutshell, the transformers have marked a turning point in the field of NLP. Entirely based on Attention Mechanism, it offers speed, accuracy, and versatility that was previously unimaginable. They’ve become the foundation for many cutting-edge Gen-AI applications, from language understanding to image processing, and much more.

And that wraps up, today we’ve just scratched the surface of the Transformer architecture. With continued research and innovation, the future holds much more exciting possibilities.

AI Resonance in the Future

536 位关注者

Naveen Raju

I help Academia & Corporates through AI-powered Learning & Growth | Facilitator - Active Learning | Development & Performance Coach | Impactful eLearning

4 个月

Hey Arun, this post really resonates with me! It's amazing how far AI has come, from RNNs to Transformers. The impact on real-world applications like chatbots and code generation is truly incredible. Can't wait to see where this transformative journey takes us next! The evolution of AI through models like BERT and GPT is fascinating. It's inspiring to see how transformers are shaping the future of technology. Let's keep pushing boundaries and innovating together in this exciting AI landscape! I invite you to our community so that we all can contribute and grow together using AI here: https://nas.io/ai-growthhackers/. LinkedIn group: https://www.dhirubhai.net/groups/14532352/

查看更多评论

要查看或添加评论，请登录

Arun Ohm的更多文章

Designing the Future: How GenAI is Transforming Interface Design

2024年10月4日

Designing the Future: How GenAI is Transforming Interface Design

Introduction An era of ubiquitous computers and intelligent interfaces is upon us. User interface (UI) design is…

1 条评论
Machine Learning Operations (MLOps) For Beginners

2024年9月26日

Machine Learning Operations (MLOps) For Beginners

It can be difficult and complex to develop, implement, and maintain machine learning models in a production setting…

2 条评论
Introduction to Weight Quantization

2024年7月3日

Introduction to Weight Quantization

One well-known drawback of large language models (LLMs) is their high computational overhead. A model's size is often…
Quantization of LLMs

2024年6月5日

Quantization of LLMs

Memory requirements for large language models (LLMs) can be high, particularly for large models such as Mixtral 8x7b…
Unlocking the Secrets of RESTful APIs: A Comprehensive Manual

2024年3月21日

Unlocking the Secrets of RESTful APIs: A Comprehensive Manual

The sharing of data between apps and services has become essential to our everyday lives in the digital age…

1 条评论
AI plays in revolutionizing social media marketing strategies.

2024年3月4日

AI plays in revolutionizing social media marketing strategies.

Social media has fundamentally transformed the way we connect with others and share our experiences, emerging as an…
Unveiling the Future of Video Creation with Text Prompts

2024年2月20日

Unveiling the Future of Video Creation with Text Prompts

Introduction: Buckle up, the world of video creation is about to undergo a dramatic shift. OpenAI’s recently unveiled…
Starbucks and the Magic of Artificial Intelligence: Improving Coffee and Customer Experience

2024年2月16日

Starbucks and the Magic of Artificial Intelligence: Improving Coffee and Customer Experience

Prelude: Starbucks, the well-known coffee brand, has been employing Artificial Intelligence (AI) to improve its…

1 条评论
Prospects of Data Visualization in 2024 and?Upward

2024年1月26日

Prospects of Data Visualization in 2024 and?Upward

Future of Data Visualization: 2024 and Beyond The area of data visualization is about to undergo a revolutionary change…

See all articles

What are the Transformers?

Arun Ohm

Gen AI Intern at BI Hub Solution | Passionate Learner | Problem solver | Machine Learning | Gen AI

Background: The Pre-Transformer Era

What is “Attention is All You Need”?

What is the Transformer?

Breaking Down the Transformer Architecture

1. Input Embedding

2. Positional Encoding

3. Encoder-Decoder Structure

4. Attention Layers

领英推荐

5. Feed-Forward Networks

6. Layer Normalization and Residual Connections

7. Linear Layer:

8. Softmax Function:

9. Output Prediction:

Why Transformers are created?

Real World Applications of Transformers

Challenges and Future Directions

Conclusion

AI Resonance in the Future

536 位关注者

Arun Ohm的更多文章

社区洞察

其他会员也浏览了

???? What’s next for Neuro-Symbolic Artificial Intelligence

Configuring a Neural Network Output Layer

A future filled with AI means everyone should be upskilled as much as they possibly can

Ahead of AI #2 - Transformers, Fast and Slow

What is AI

Outperforming LLMs with Fewer Data and Smaller Model Sizes; Toward Federated GPT; You Can Learn and Get Work Done at the Same Time; and More.

The Most Amazing Artificial Intelligence Milestones So Far

A Primer on Modern AI

Legal Issues Gen AI: Creation Tool or Generic Output

Intellectual abilities of artificial intelligence (AI)

Background: The Pre-Transformer Era

What is “Attention is All You Need”?

What is the Transformer?

Breaking Down the Transformer Architecture

1. Input Embedding

2. Positional Encoding

3. Encoder-Decoder Structure

4. Attention Layers

领英推荐

5. Feed-Forward Networks

6. Layer Normalization and Residual Connections

7. Linear Layer:

8. Softmax Function:

9. Output Prediction:

Why Transformers are created?

Real World Applications of Transformers

Challenges and Future Directions

Conclusion

AI Resonance in the Future

536 位关注者

Arun Ohm的更多文章

Designing the Future: How GenAI is Transforming Interface Design

Machine Learning Operations (MLOps) For Beginners

Introduction to Weight Quantization

Quantization of LLMs

Unlocking the Secrets of RESTful APIs: A Comprehensive Manual

AI plays in revolutionizing social media marketing strategies.

Unveiling the Future of Video Creation with Text Prompts

Starbucks and the Magic of Artificial Intelligence: Improving Coffee and Customer Experience

Prospects of Data Visualization in 2024 and?Upward

社区洞察

其他会员也浏览了

???? What’s next for Neuro-Symbolic Artificial Intelligence

Configuring a Neural Network Output Layer

A future filled with AI means everyone should be upskilled as much as they possibly can

Ahead of AI #2 - Transformers, Fast and Slow

What is AI

Outperforming LLMs with Fewer Data and Smaller Model Sizes; Toward Federated GPT; You Can Learn and Get Work Done at the Same Time; and More.

The Most Amazing Artificial Intelligence Milestones So Far

A Primer on Modern AI

Legal Issues Gen AI: Creation Tool or Generic Output

Intellectual abilities of artificial intelligence (AI)