Navigating the GenAI Frontier : Transformers, GPT, and the Path to Accelerated Innovation
Matala Sushma
Aspiring Data Scientist | Student at Nadimpalli Satyanarayana Raju Institute of Technology
What is Generative AI?
Visualize Generative AI as a super AI which not only accesses the web for answers, search procedures, and much more, it does so almost instantaneously. This progressive tech can even do something better than just copying the original works – it also condescends to the special field of creating totally new items such as beautiful music or extraordinary stories.
Unlike with your common AI that just memorizes facts, Generative AI instead dig deeper and deeper into text, code, and even image texts, applying physics of the structure to get a taste of the secret sauce of content generation.
How Does it Work?
Pretend AI Generative in the form of some huge fruit bowl full of text (not letting a sentence be an exception, e.g., articles, books, code). It does the analysis of components – the selection of words, kind of sentences and the sequence of ideas. The machine gives this information, does the data analytics treatment and then acquires the key ingredients of new and creative content production.
So, why we continue to rely on technology to provide for us, right?
The possibilities of using generative AI are endless! Here are just a few ways Generative AI can be your creative sidekick:
Here are just a few ways Generative AI can be your creative sidekick:
1. Historical Context: Seq2Seq Paper and NMT by Joint Learning to Align & Translate Paper
The Sequence-to-Sequence (Seq2Seq) model is a neural network architecture designed for sequence transduction tasks, where an input sequence is mapped to an output sequence. It was introduced in the paper "Sequence to Sequence Learning with Neural Networks" by Ilya Sutskever, Oriol Vinyals, and Quoc V. Le, published in 2014.
The Seq2Seq model consists of two main components: an encoder and a decoder. These components are typically implemented using recurrent neural networks (RNNs) or their variants like Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) cells.
The paper "Neural Machine Translation by Jointly Learning to Align and Translate" by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, published in 2014, introduced a breakthrough approach to neural machine translation (NMT) by incorporating an attention mechanism into the Seq2Seq framework.
Here's an overview of the key contributions and innovations of the paper:
2. Introduction to Transformers and the working of components.
Transformer is a model introduced in the paper Attention is All You Need in 2017. It is based solely on attention mechanisms: i.e., without recurrence or convolutions. On top of higher translation quality, the model is faster to train by up to an order of magnitude. Currently, Transformers (with variations) are de-facto standard models not only in sequence to sequence tasks but also for language modeling and in pretraining settings, which we consider in the next lecture.
Transformer introduced a new modeling paradigm: in contrast to previous models where processing within encoder and decoder was done with recurrence or convolutions, Transformer operates using only attention.
1.Self-Attention:
Self-attention is one of the key components of the model. The difference between attention and self-attention is that self-attention operates between representations of the same nature: e.g., all encoder states in some layer.
Self-attention is the part of the model where tokens interact with each other. Each token "looks" at other tokens in the sentence with an attention mechanism, gathers context, and updates the previous representation of "self".
Query, Key, and Value in Self-Attention
Formally, this intuition is implemented with a query-key-value attention. Each input token in self-attention receives three representations corresponding to the roles it can play:
2.Masked Self-Attention:
In the decoder, self-attention is a bit different from the one in the encoder. While the encoder receives all tokens at once and the tokens can look at all tokens in the input sentence, in the decoder, we generate one token at a time: during generation, we don't know which tokens we'll generate in future.
3.Multi-Head Attention:
Usually, understanding the role of a word in a sentence requires understanding how it is related to different parts of the sentence. This is important not only in processing source sentence but also in generating target.
领英推荐
For example, in some languages, subjects define verb inflection (e.g., gender agreement), verbs define the case of their objects, and many more. What I'm trying to say is: each word is part of many relations.
3. Why transformers
Transformer: Model Architecture
Intuitively, the model does exactly what we discussed before: in the encoder, tokens communicate with each other and update their representations; in the decoder, a target token first looks at previously generated target tokens, then at the source, and finally updates its representation. This happens in several layers, usually 6.
4. Explain the working of each transformer component
?Feed-forward blocks
In addition to attention, each layer has a feed-forward network block: two linear layers with ReLU non-linearity between them.After looking at other tokens via an attention mechanism, a model uses an FFN block to process this new information (attention - "look at other tokens and gather information", FFN - "take a moment to think and process this information").
?Residual connections
We already saw residual connections when talking about convolutional language models. Residual connections are very simple (add a block's input to its output), but at the same time are very useful: they ease the gradient flow through a network and allow stacking a lot of layers.
In the Transformer, residual connections are used after each attention and FFN block. On the illustration above, residuals are shown as arrows coming around a block to the yellow "Add & Norm" layer. In the "Add & Norm" part, the "Add" part stands for the residual connection.
?Layer Normalization
The "Norm" part in the "Add & Norm" layer denotes Layer Normalization. It independently normalizes vector representation of each example in batch - this is done to control "flow" to the next layer. Layer normalization improves convergence stability and sometimes even quality.
In the Transformer, you have to normalize vector representation of each token. Additionally, here LayerNorm has trainable parameters, scale and bias, which are used after normalization to rescale layer's outputs (or the next layer's inputs).
?Positional encoding
Note that since Transformer does not contain recurrence or convolution, it does not know the order of input tokens. Therefore, we have to let the model know the positions of the tokens explicitly. For this, we have two sets of embeddings: for tokens (as we always do) and for positions (the new ones needed for this model). Then input representation of a token is the sum of two embeddings: token and positional.
The positional embeddings can be learned, but the authors found that having fixed ones does not hurt the quality.
5. How is GPT-1 trained from Scratch
1. Data Collection: Gather a large corpus of text data from various sources such as books, articles, websites, etc. The data should be diverse and representative of the language GPT-1 is intended to understand and generate.
2. Preprocessing: Clean and preprocess the text data to remove any noise, irrelevant information, or formatting inconsistencies. This might involve tasks like tokenization, lowercasing, removing special characters, and splitting the text into smaller units like sentences or paragraphs.
3. Tokenization: Convert the text data into a format that the model can understand, typically a sequence of tokens or words. Each unique token represents a word or subword in the vocabulary.
4. Model Architecture: Design the architecture of GPT-1, which typically consists of a transformer-based neural network. GPT-1 specifically uses a transformer decoder architecture.
5. Initialization: Initialize the parameters of the model randomly or using pre-trained embeddings. In the case of GPT-1, the parameters might be initialized randomly as there would not be pre-trained embeddings like in later versions of the model.
6. Training Objective: Define the training objective, which is typically to minimize a loss function such as cross-entropy loss. The model is trained to predict the next token in a sequence given the previous tokens.
7. Training Procedure: Train the model on the preprocessed text data using techniques like backpropagation and stochastic gradient descent. The training process involves iteratively updating the model parameters to minimize the defined loss function.
8. Hyperparameter Tuning: Tune various hyperparameters such as learning rate, batch size, and model architecture to optimize the performance of the model on a validation set.
9. Evaluation: Evaluate the performance of the trained model on a held-out test set to assess its ability to generate coherent and contextually relevant text.
10. Fine-Tuning (Optional): Optionally, fine-tune the trained model on specific tasks or domains by continuing training on task-specific data with a task-specific objective.