What are the Transformers?
With all the buzz around Generative AI Tools like ChatGPT,?Gemini, DALL-E2, AlphaCode, etc, that uses Large Language Models (LLMs) (like GPT, BERT, Cohere, LLAMA, Mistral, etc), it is crucial to look at the work that influenced it all.
Background: The Pre-Transformer Era
Before Transformers, NLP models heavily relied on Recurrent Neural Networks (RNNs) and their more sophisticated siblings, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks.
These models were capable of processing sequential data (which means they can process text one word at a time) with a degree of context awareness. (Important point to mark!)
While RNNs and LSTMs had their respective moments of glory, these models had their own limitations:
From LSTMs to LLMs, we had witness the large advancements in the domain of Sequence to Sequence Learning.
Well, before diving into the transformers, it’s important to note that the origin of Transformers was seeded by an improvement proposed in the Encoder-Decoder Architecture by Ilya Sutskever and his team through their paper “Sequence to Sequence Learning with Neural Networks” (2014–2015).
What is “Attention is All You Need”?
At the heart of LLMs breakthrough lies the key paper “Attention Is All You Need,” by Vaswani et al. and the group of researchers from Google Brin, published in 2017. Despite its deceptively straightforward title, this paper has completely changed the approach used for machine learning tasks involving sequential data.
What is the Transformer?
The transformer is a set of neural networks layers that consist of an encoder and a decoder with self-attention capabilities, that tosses aside the limitations of RNNs and its variants.
Instead of processing words sequentially (one by one per timestamp), transformers can handle entire sentences or documents at once by it processing parallelly. This approach not only made them faster but also more accurate in capturing the context of words in a sentence (will have detailed discussion around this in later articles).
Breaking Down the Transformer Architecture
1. Input Embedding
First, the input sequence of text is converted into a fixed-size vectors or input embeddings, capturing the lexical and syntactic features of the text.
Well, this layer maps each token to a high-dimensional embedding space where semantically similar tokens stay closer.
Consider the sentence: “Transformers enhance LLM capabilities”, here the tokens “Transformers,” “enhance,” “LLM,” and “capabilities are transformed into embeddings, where “Transformers” and “LLM” will stay closer.
2. Positional Encoding
Since transformers process the entire sentence at once, they need a way to remember the order of words. Positional encoding is added to the token embeddings to provide information about the position of each token in the sequence.
Note: It is further useful for a model to distinguish between tokens with the same embedding but at different positions.
As shown in illustration, point-wise positional encoding is added to the respective token embedding, to help the model better understand the sequence order.
3. Encoder-Decoder Structure
The Transformer model follows an encoder-decoder architecture:
4. Attention Layers
At the core of the transformer resides an Attention Mechanism, which enhances the encoder-decoder architecture capabilities by enabling the model to focus on different parts of the input sequence dynamically.
There are three types of attention mechanisms in transformers model:
Here, in our case, “Transformers” attends to “enhance,” “LLM,” and “capabilities”, to understand its contextual importance (i.e., how it relates to these words).
领英推荐
While predicting “capabilities,” the decoder might focus on the encoder’s contextual embeddings for “Transformers,” “enhance,” and “LLM”, thereby attending to the relevant parts of the input sequence.
An auto-regressive model is a self-predictive model. It predicts a word, then that word is used to predict the next word, which is used to predict the next word, and it goes on till the mentioned number of tokens expire.
5. Feed-Forward Networks
After the attention mechanisms, the model passes the information through the position-wise feed-forward networks, applying fully connected layers independently to each position in the sequence, enabling the model to capture complex non-linear relationships between tokens.
6. Layer Normalization and Residual Connections
The “Add & Norm” operation in a Transformer involves adding the input to the output of the feedforward network and then normalizing the combined result. This process helps stabilize training and promotes effective information (gradient) flow through the network and residual connections.
7. Linear Layer:
The normalized sequence of vectors from the last decoder layer, capturing the contextualized representation of the tokens one for each position in the input sequence in passed through a linear layer.
Architecturally, the linear layer is a fully-connected NN layer that applies a linear transformation to the input using weight matrix and bias vector.
8. Softmax Function:
After the linear transformation, a softmax function is applied on the output to produce a probability distribution over the vocabulary for each position in the sequence.
The softmax function is a common activation function that converts the logits into probabilities. It ensures that the output values sum up to 1, from which the most likely token is selected as output.
This probability distribution represents the model’s confidence in each possible token for the given position being the next word in the output sequence.
9. Output Prediction:
During training, the model uses teacher forcing method, where the true previous token is fed into the decoder at each step.
Whereas, during inference, the model can select the most probable token (using greedy search) or sample from the probability distribution (by selecting the token of highest probability) or you can also use more advanced techniques like beam search to generate the next token in the sequence in an auto-regressive manner.
The predicted output token is fed back into the decoder as input for the next time step, along with previously generated tokens and the encoder’s hidden states.
This process is repeated iteratively until an end-of-sequence token (e.g., <eos>) is generated or a predetermined maximum length is reached.
Why Transformers are created?
Transformers are the backbone of many state-of-the-art NLP models, including BERT, GPT, T5, etc, as they offers:
Real World Applications of Transformers
Transformers have found their way into numerous machine and deep learning applications, transforming how we interact with technology nowadays.
Challenges and Future Directions
While Transformers have achieved remarkable success, they’re not without their challenges.
1. High Computational Cost (Time and Resources) required in training Transformers.
2. Low Interpretability of these “black box” models, makes it hard to understand how they make decisions.
3. Ensuring fairness and reducing bias (overfitting) in Transformer models is a critical area of ongoing research.
4. Scalability becomes increasingly challenging with increase in params. Techniques like model pruning, quantization, and knowledge distillation are being explored to address this issue.
Conclusion
In a nutshell, the transformers have marked a turning point in the field of NLP. Entirely based on Attention Mechanism, it offers speed, accuracy, and versatility that was previously unimaginable. They’ve become the foundation for many cutting-edge Gen-AI applications, from language understanding to image processing, and much more.
And that wraps up, today we’ve just scratched the surface of the Transformer architecture. With continued research and innovation, the future holds much more exciting possibilities.
I help Academia & Corporates through AI-powered Learning & Growth | Facilitator - Active Learning | Development & Performance Coach | Impactful eLearning
4 个月Hey Arun, this post really resonates with me! It's amazing how far AI has come, from RNNs to Transformers. The impact on real-world applications like chatbots and code generation is truly incredible. Can't wait to see where this transformative journey takes us next! The evolution of AI through models like BERT and GPT is fascinating. It's inspiring to see how transformers are shaping the future of technology. Let's keep pushing boundaries and innovating together in this exciting AI landscape! I invite you to our community so that we all can contribute and grow together using AI here: https://nas.io/ai-growthhackers/. LinkedIn group: https://www.dhirubhai.net/groups/14532352/