Navigating the GenAI Frontier: Transformers, GPT, and the Path to Accelerated Innovation
SOURCE: https://ai.plainenglish.io/llm-tutorial-10-t5-text-to-text-transfer-transformer-a464b38e5366

Navigating the GenAI Frontier: Transformers, GPT, and the Path to Accelerated Innovation

Historical Context: Seq2Seq Paper and NMT by Joint Leaning to Align & Translate Paper

Before 2013, various neural network architectures such as Artificial Neural Networks (ANN), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN) gained popularity for processing different types of data such as tabular, image, and sequential data (like text), respectively. While these Deep Neural Networks (DNNs) performed well with large labeled training sets, they encountered challenges including extended training times, issues with long-term dependencies, and the inability to map sequences to sequences. To address these challenges, in 2014, three researchers from Google - Ilya Sutskever Oriol Vinyals and Quoc V. Le proposed a solution in their paper "Sequence to Sequence Learning with Neural Networks."

Sequence to Sequence Learning with Neural Networks

  1. The paper introduces Seq2Seq models, which are neural network architectures designed to map input sequences to output sequences.
  2. Unlike traditional models that rely on fixed-length input-output mappings, Seq2Seq models can accommodate variable-length sequences, making them suitable for tasks such as machine translation, summarization, and question answering.
  3. At the heart of the Seq2Seq model lies the encoder-decoder architecture.
  4. The encoder processes the input sequence while retaining the hidden state and generates a fixed-length representation, often referred to as a context vector. This context vector encapsulates the representation of the entire sentence.
  5. Subsequently, the decoder utilizes this representation to generate the output sequence token by token.
  6. Both the encoder and decoder employ RNN/LSTM cells due to their proficiency in capturing sequential dependencies.
  7. This architecture used to perform well with shorter sentences.

Limitation: Its capability to only manage variable-length input and output sequences. It relied on generating a solitary fixed-length context vector for the complete input sequence, which could result in information loss, particularly for lengthier sequences.

To address the above Limitation three students from Germany Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio - proposed the "Neural Machine Translation by Joint Learning to Align and Translate" paper in 2015.

Neural Machine Translation by Joint Learning to Align and Translate

  1. This paper introduced the concept of the Attention Mechanism.
  2. In contrast to traditional Neural Machine Translation (NMT) models, which encode the entire source sentence into a fixed-length context vector, the attention mechanism empowers the model to dynamically focus on different segments of the source sentence while generating the translation.
  3. The Attention Mechanism also tackles the challenge of learning alignment between input and output sequences. It allows the model to assign varying levels of importance to each word in the source sentence during translation.
  4. By adapting the attention weights dynamically, the model can prioritize relevant words and disregard irrelevant ones, leading to more precise translations.
  5. At each step of the decoder, the dynamically computed context vector indicates which steps of the encoder sequence are anticipated to exert the most influence on the current decoding step of the decoder.
  6. Simply put, the context vector is the weighted sum of the encoder's hidden states, with these weights referred to as attention weights.

Limitation: While the attention mechanism has enhanced the quality of translations for lengthy input sentences, it has not resolved a significant underlying issue: sequential training.

Despite addressing the sequence-to-sequence challenge with the Seq2Seq architecture and attention mechanism, issues such as high training times and long-term dependencies persist.

To solve the above issues Transformers were introduced.

Introduction to Transformers (Paper: Attention is all you need)

Google introduced Transformers in 2017 as a Sequence to Sequence Model primarily designed to tackle Machine Translation challenges. It comprises two key components: the Encoder-Decoder and Attention Mechanism.

  1. The Encoder is responsible for ingesting raw text, breaking it down into its fundamental elements, converting them into vectors, and employing self-attention to grasp the text's context.
  2. On the other hand, the Decoder specializes in text generation, utilizing a variant of attention (known as cross-attention) to forecast the next optimal token.

Transformers undergo training to address a specific NLP task known as Language Modeling.

Working of each transformer component

The Transformer architecture consists of both an encoder and a decoder. The Encoder excels at comprehending text, while the Decoder is proficient at generating text. The Transformer primarily depends on self-attention mechanisms and feed-forward neural networks.

Attention is a pivotal mechanism within the Transformer framework. It allocates varying weights to different segments of the input, enabling the model to prioritize and underscore crucial information during tasks such as translation or summarization. This dynamic allocation of attention empowers the model to dynamically focus on different elements of the input, consequently enhancing its performance.

Source: Attention is all you need Paper

  1. Positional Encoding: To preserve the positional relationships between words in the input sequence without resorting to recurrence, the model introduces positional encodings. These encodings are added to the input embeddings, furnishing information about the position of each word within the sequence.
  2. Self-Attention Mechanism: The cornerstone innovation of the Transformer lies in its self-attention mechanism, permitting each word in the input sequence to attend to all other words within the sequence. This capability facilitates the capture of global dependencies and obviates the necessity for recurrent connections.
  3. Multi-Head Attention: The Transformer integrates multi-head attention mechanisms, wherein attention is computed multiple times concurrently with distinct learned linear projections. This approach empowers the model to simultaneously focus on diverse segments of the input sequence, thereby augmenting its capacity to apprehend varied patterns.
  4. Feed-Forward Network: The primary objective of the feed-forward network is to apply nonlinear transformations to input representations, aiding the model in capturing intricate patterns and relationships within the input sequences. This process enhances the richness of word or token representations in the input sequence.
  5. Skip/Residual Connections: The primary purpose of skip connections is to enable the network to retain vital information from preceding layers, facilitating the learning and optimization of complex patterns within the data. Imagine skip connections as shortcuts that allow information to bypass certain layers in the network. These shortcuts ensure that crucial information from earlier layers is preserved and remains accessible to subsequent layers.

How is GPT 1 trained from Scratch?

Need of GPT: The need for GPT arises from the wide array of tasks within natural language understanding, such as textual entailment, question answering, semantic similarity assessment, and document classification. Despite the abundance of large unlabeled text corpora, there's a shortage of labeled data tailored for training models specifically for these tasks. This scarcity challenges discriminatively trained models to achieve satisfactory performance. In response, OpenAI introduced GPT-1.

Source: Fireflies.ai

About GPT: GPT, short for Generative Pre-Trained Transformer, operates as an autoregressive model. This means it utilizes previously predicted tokens as input to predict subsequent tokens. GPT employs the decoder block of the Transformer architecture to forecast the next token in a sequence, enabling it to generate coherent text.

During each iteration, GPT starts with an initial sequence and predicts the next most probable token for it. Following this prediction, the sequence and the predicted token are combined and forwarded as input to predict the subsequent token, and so forth. This iterative process continues until either the model predicts the [end] token or the maximum input size is reached.

Training of GPT: GPT-1 underwent training using an extensive corpus of text sourced from diverse genres, comprising over 7000 unique unpublished books. The raw text was cleaned and standardized for punctuation and whitespace using the ftfy library. Additionally, the spaCy tokenizer was employed for further preprocessing of the data.

Following preprocessing, this substantial dataset was utilized to train a 12-layer decoder-only transformer model. The transformer utilized masked self-attention heads to enhance its learning capabilities. For optimization during training, the Adam optimizer was employed with a maximum learning rate set at 2.5e-4. Furthermore, the activation function utilized in the model was the Gaussian Error Linear Unit (GELU).

References

  1. https://towardsdatascience.com/large-language-models-gpt-1-generative-pre-trained-transformer-7b895f296d3b
  2. I suggest checking out Kanav Bansal Github Repo, that assisted me in crafting this blogpost.


I'd like to extend my gratitude to Innomatics Research Labs Labs for providing me with an enriching internship experience. Additionally, special thanks to Kanav Bansal al for his invaluable mentorship on Generative AI topics, which has been immensely beneficial for students.


Vincent Valentine ??

CEO UnOpen.Ai | exCEO Cognitive.Ai | Building Next-Generation AI Services | Available for Podcast Interviews | Partnering with Top-Tier Brands to Shape the Future

10 个月

Keep up the fantastic work! Can't wait to read more about your insights into Generative AI. Sanchit Singla

回复

Cannot wait to dive into your insightful article on Generative AI! Sanchit Singla

回复
Ishu Bansal

Optimizing logistics and transportation with a passion for excellence | Building Ecosystem for Logistics Industry | Analytics-driven Logistics

10 个月

Can you share any insights on the potential applications of Generative AI in industries such as healthcare or finance? #AI #GenerativeAI.

回复

要查看或添加评论,请登录

Sanchit Singla的更多文章

社区洞察

其他会员也浏览了