The Transformer Tree and Its Prime Yield: LLM (Large Language Model)

The Transformer Tree and Its Prime Yield: LLM (Large Language Model)

Introduction

Imagine a giant tree standing tall in a vast field. This tree represents the Transformer architecture, a groundbreaking innovation in artificial intelligence. Now, picture the most delicious, juicy fruit hanging from its branches. This fruit symbolizes Large Language Models (LLMs), the best product of the Transformer tree. In this blog, we'll explore what LLMs and Transformers are, why the Transformer architecture is so vital for LLMs, and how these technologies are shaping our world with real-life examples.

What is a Transformer?

A Transformer is a specific type of neural network architecture that has revolutionized how LLMs are built. Introduced in a 2017 paper titled "Attention is All You Need" by Vaswani et al., the Transformer architecture has become the backbone of many state-of-the-art language models.

Why Transformer?

Consider the process of reading a complex novel. Instead of tackling one sentence at a time, you'd examine the entire page for a comprehensive understanding. This mirrors the operation of Transformers in machine learning—examining entire sequences in data input as a whole. This holistic approach allows Transformers to grasp the context and relationships between words more effectively, leading to superior performance in language-related tasks.

Unlike RNNs, GRUs, and LSTMs that process data sequentially (word by word or moment by moment), Transformers simultaneously assess the entire data sequence, providing a richer understanding of the context.

For example, consider the task of translating these sentences:

  1. Sentence 1: The bank of the river.
  2. Sentence 2: Money in the bank.

An RNN might confuse "bank" for a riverbank or a financial institution, which can affect the accuracy of the translation. In contrast, a Transformer processes the entire sentence at once and understands that the token "bank" refers to a financial institution because the context depends on “closed”, leading to a more accurate translation.

LLMs like GPT, LLaMA, BART, etc. excel in such tasks due to their ability to understand context across the entire sequence of words, resulting in more accurate and contextually relevant translations.

RNNs can be slow with long sequences, struggling to retain distant information. Transformers, however, are faster and more efficient, making them superior for handling large datasets and complex tasks like text generation, language translation, and paraphrasing text.


Now let's briefly understand the architecture of the Transformer.

Transformer Architecture

If you’ve seen the architecture of a transformer model, you may have jumped in awe like I did the first time I saw it, it looks quite complicated! However, when you break it down into its most important parts, it’s not so bad. The transformer has 4 main parts:

  • Tokenization
  • Embedding
  • Positional encoding
  • Transformer block (several of these)
  • Softmax

The fourth one, the transformer block, is the most complex of all. Many of these can be concatenated, and each one contains two main parts: The attention and the feedforward components.

The architecture if transformer

Let’s study these parts one by one.

Tokenization

Tokenization is the most basic step. It consists of a large dataset of tokens, including all the words, punctuation signs, etc. The tokenization step takes every word, prefix, suffix, and punctuation sign, and sends them to a known token from the library.

Tokenization: Turning words into tokens

For example, if the sentence is “Write a story”, then the 4 corresponding tokens will be <Write>, <a>, <story>, and \<.>.

Embedding

Once the input has been tokenized, it’s time to turn words into numbers. For this, we use an embedding. In a previous chapter, you learned about how text embeddings send every piece of text to a vector (a list) of numbers. If two pieces of text are similar, then the numbers in their corresponding vectors are similar to each other (componentwise, meaning each pair of numbers in the same position is similar). Otherwise, if two pieces of text are different, then the numbers in their corresponding vectors are different.

For example, if the sentence we are considering is “Write a story.” and the tokens are <Write>, <a>, <story>, and \<.>, then each one of these will be sent to a long vector, and we’ll have four vectors.

In general, embedding sends evey every word/token to the list of numbers

Positional encoding

Once we have the vectors corresponding to each of the tokens in the sentence, the next step is to turn all these into one vector to process. The most common way to turn a bunch of vectors into one vector is to add them, componentwise. That means we add each coordinate separately. For example, if the vectors (of length 2) are [1,2], and [3,4], their corresponding sum is [1+3, 2+4], which equals [4, 6]. This can work, but there’s a small caveat. Addition is commutative, meaning that if you add the same numbers in a different order, you get the same result. In that case, the sentence “I’m not sad, I’m happy” and the sentence “I’m not happy, I’m sad”, will result in the same vector, given that they have the same words, except in different order. This is not good. Therefore, we must come up with some method that will give us a different vector for the two sentences. Several methods work, and we’ll go with one of them: positional encoding. Positional encoding consists of adding a sequence of predefined vectors to the embedding vectors of the words. This ensures we get a unique vector for every sentence, and sentences with the same words in different order will be assigned different vectors. In the example below, the vectors corresponding to the words “Write”, “a”, “story”, and “.” become the modified vectors that carry information about their position, labeled “Write (1)”, “a (2)”, “story (3)”, and “. (4)”.

Now that we know we have a unique vector corresponding to the sentence, and that this vector carries the information on all the words in the sentence and their order, we can move to the next step.

Transformer block

Let’s recap what we have so far. The words come in and get turned into tokens (tokenization), tokenized words are turned into numbers (embeddings), and then the order gets taken into account (positional encoding). This gives us a vector for every token that we input to the model. Now, the next step is to predict the next word in this sentence. This is done with a really really large neural network, which is trained precisely with the goal, of predicting the next word in a sentence.

We can train such a large network, but we can vastly improve it by adding a key step: the attention component. Introduced in the seminal paper Attention is All You Need, it is one of the key ingredients in transformer models and one of the reasons they work so well. Attention is explained in the previous section, but for now, imagine it as a way to add context to each word in the text.

The attention component is added at every block of the feedforward network. Therefore, if you imagine a large feedforward neural network whose goal is to predict the next word, formed by several blocks of smaller neural networks, an attention component is added to each one of these blocks. Each component of the transformer, called a transformer block, is then formed by two main components:

  • The attention component.
  • The feedforward component.

The transformer is a concatenation of many transformer blocks.

The Softmax Layer

Now that you know that a transformer is formed by many layers of transformer blocks, each containing attention and a feedforward layer, you can think of it as a large neural network that predicts the next word in a sentence. The transformer outputs scores for all the words, where the highest scores are given to the words that are most likely to be next in the sentence.

The last step of a transformer is a softmax layer, which turns these scores into probabilities (that add to 1), where the highest scores correspond to the highest probabilities. Then, we can sample out of these probabilities for the next word. In the example below, the transformer gives the highest probability of 0.5 to “Once”, and probabilities of 0.3 and 0.2 to “Somewhere” and “There”. Once we sample, the word “once” is selected, and that’s the output of the transformer.

The softmax layer turns the scores into probabilities, and these are used to pick the next word in the text.

Now what? Well, we repeat the step. We now input the text “Write a story. Once” into the model, and most likely, the output will be “upon”. Repeating this step again and again, the transformer will end up writing a story, such as “Once upon a time, there was a …”.


Conclusion

This blog post provides a primer on Large Language Models (LLMs) and the Transformer Architecture that powers LLMs like GPT. LLMs have revolutionized natural language processing by generating coherent and fluent text, leveraging massive pre-training on vast text datasets. The Transformer architecture is the cornerstone of all LLMs, enabling models like GPT to produce accurate and contextually relevant output.

With capabilities in text generation, summarization, and question-answering, LLMs such as GPT are opening new possibilities for human-machine interaction and communication. In summary, LLMs represent a significant advancement in natural language processing, promising to enhance human-machine interaction in exciting and transformative ways.


References

1. Google- Attention is all you need

2. Huggingface- How do transformer works

3. Cohere- Transformer Models

要查看或添加评论,请登录

社区洞察

其他会员也浏览了