Unveiling Transformers: The Fascinating Architecture Shaping the Future
Rania FatmaZohra Rezkellah
Computer Science Engineer @ESI | IASD MSc Student @PSL | AI intern @CERN | SWE trainee @A2SV
ChatGPT, born from GPT-3.5, revolutionized tech by offering tailored aid and clearing academic hurdles with ease. Large Language Models (LLMs) fuel this transformation, reshaping AI interactions and decision-making. Behind these marvels lies Transformers architecture, the unsung hero enabling nuanced conversations and intricate problem-solving. Stay tuned to explore this foundational structure redefining AI capabilities.
Beneath the Surface: Exploring the Wonders of Transformers Architecture
The concept of the Transformer was introduced in a 2017 paper titled “Attention Is All You Need” by Ashish Vaswani, Noam Shazeer, Niki Parmar, and five other authors.
Let’s begin by seeing this architecture as a black box. Taking the example of a machine translation (MT) task, it would take a sentence from one language and output it in another one.
Deep inside this prestigious black box, we see an encoder component , a decoder component, and some connections between them.
The encoding component is a stack of encoders (the original paper stacks six encoders on top of each other). The decoding component is a stack of decoders of the same number.
The encoders are all identical in structure. Each one is broken down into two sub-layers:
The encoder begins by processing its inputs through a self-attention layer. This pivotal layer enables the encoder to examine other words within the input sentence while encoding a particular word. Later in this post, we'll delve deeper into the mechanics of self-attention.
Subsequently, the outcomes from the self-attention layer traverse through a feed-forward neural network. This identical feed-forward network autonomously operates on each position, reinforcing the encoder's comprehensive processing of the input sequence.
Note that the encoders don’t share weights.
The decoder incorporates both these layers, with an attention layer positioned between them. This layer helps the decoder in directing its focus toward pertinent segments of the input sentence, functioning similarly to the attention mechanism in sequence-to-sequence models.
Digging deeper into tensors
Having explored the key constituents of the model, let's now delve into the flow of vectors/tensors across these components, transforming the input of a trained model into an output.
In typical NLP applications, each input word undergoes conversion into a numerical vector through an embedding algorithm.
In transformers, the embedding specifically occurs in the lowest encoder, where every word is transformed into a 512-sized vector. Across subsequent encoders, there's a shared abstraction—they receive a list of 512-sized vectors. The size of this list is a hyperparameter that we can define, typically aligning with the length of the longest sentence in the training dataset.
After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.
This is the initial glimpse into a fundamental aspect of the Transformer model, where individual words in each position chart their unique paths within the encoder. While there exist interdependencies among these paths in the self-attention layer, the feed-forward layer operates independently. This autonomy enables the diverse paths to run concurrently as they progress through the feed-forward layer.
Shifting gears to a shorter sentence example, we'll explore the operations within each sub-layer of the encoder in our investigation.
领英推荐
Self-attention at a higher level
Consider this sentence: "The cat didn't eat its food because it was sick."
Now, ponder over the reference of "it" in this sentence. Does it point to the food or the cat? A seemingly simple question for a human but a complex one for an algorithm.
As the model processes the word "it," self-attention enables it to associate "it" with "cat."
With every word processing stage (each position in the input sequence), self-attention empowers the model to survey other positions within the sequence, seeking clues that enhance the encoding of the word under consideration.
If you're acquainted with Recurrent Neural Networks (RNNs), consider how maintaining a hidden state enables an RNN to amalgamate its interpretation of previous words/vectors with the current one it's processing. Self-attention is the Transformer's method to infuse the "understanding" of other pertinent words into the one currently being processed.
For further detail concerning the self-attention mechanism, check my article on medium
Other key components of this fancy architecture
Two other key components of the Transformer architecture are positional encodings and multi-head attention mechanism.
Positional encodings serve the purpose of embedding the sequential order of input within a given sequence. This innovative approach allows for the parallel processing of words in a sentence, departing from the conventional sequential feeding into the neural network. This is actually done by adding a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.
Multi-head self-attention allows the model to weigh the importance of different tokens in the input when making predictions for a particular token. The “multi-head” aspect allows the model to learn different relationships between tokens at different positions and levels of abstraction.
The final stage: The Linear Layer
The output from the decoder stack is represented by a vector of floating-point numbers. But how do we translate this into an actual word? This task falls to the final Linear layer, succeeded by a Softmax Layer.
The Linear layer acts as a straightforward fully connected neural network that maps the vector generated by the series of decoders into an immensely larger vector termed a logit vector.
Assuming our model has learned from a training dataset containing 10,000 distinct English words (termed as our model's "output vocabulary"), the logits vector would be 10,000 cells wide. Each cell within this vector corresponds to the score of a unique word, presenting the interpretation of the model's output post the Linear layer.
Following the Linear layer, the softmax layer operates to transform these scores into probabilities, ensuring all values are positive and collectively sum up to 1.0. The word associated with the cell holding the highest probability is selected [using argmax function], thereby becoming the output for that specific time step.
References