Understanding why Text Generators work using Decoder-only Transformers
Nitin Aggarwal
CEO & Senior Research Scientist at PradhVrddhi Research | Data Science & AI Solutions Provider | Helping Businesses Automate & Innovate
Disclaimer
This article is based on my understanding of text generators and is written based on my attempts to understand transformers, or more specifically text generators. I am not claiming that my knowledge or understanding is perfect. This is an attempt to open a discussion on text generators and to collectively understand them well enough so that the research can be done to improve these text generators.
What is a text generator?
Imagine that you are having a conversation with someone. With every word you speak you describes a piece of information that you are about to give. If you stop mid-sentence, the listener is going to ask a question (if they are actually listening). If one tries really hard, squeeze their eyes 70% open, they will see a query in the form of that information being given. The query made is essentially asking for some information that is already known, so the generator is left with two tasks, identify the index or key where the query matches within the context and/or knowledge, and then return the value that is being sought after. The retrieved value becomes the next word that is to be generated. Perfect! There is a catch, however, we don't want to build a query generator, key-query matcher, value retriever and parser, and lastly next word generator. That's too much work! This is actually the job taken care of beautifully by decoder-only transformers
Let's start at the end.
Why don't we build a model that takes as an input a sequence of tokens
What is a next token generator?
Having a logistic distribution is to have certain weights attached to every known token, so that after normalization, a new word can be generated from a given logistic distribution. We are essentially looking for a vector of weights of size equal to the size of our vocabulary of tokens. We are going to call this size as vocabulary size from now on. These vectors are what we are going to call next token generator from now on.
Maximum sequence length for next token generator
Lets rewind, we need a generator for any sequence of tokens of any length!? But do we actually need it for all lengths? We can actually settle for some maximum sequence length, right? If that is not enough we can always build a bulkier model. Let us call this maximum as maximum sequence length, why not? Also, being a human we are also not capable of remembering any amount of words from an ongoing conversation. We have a limit as well, and the little transformer guys are going to have even more sever limit. But on other hand, going forward we keep on processing the concluded conversation. Therefore, we are not really remembering the infinite sequence of words, we are actually processing the earlier part of conversation into our knowledge. This particular functionality is what we optionally choose to ignore for now. So, input sequences are going to have a maximum sequence length of maximum sequence length.
领英推荐
All sequence lengths from 1 to maximum sequence length
If we want to allow the text generator to work for input sequence lengths from one to any sequence length less than maximum sequence length, then we will have to come up with a model that handles it all. How about getting an output where if input is a sequence of token indices, then output try to predict the next word for every starting subsequence of the sequence, i.e. for the sequence of one token (first token) produces a generator of second token, for the sequence of first and second tokens a generator of the third token is produced, for the sequence of first to third tokens the generator of the fourth token is produced, and so on and so forth. That is, until the maximum of maximum sequence length is reached. So, if an input of a certain sequence length is given, then the last output needs to be the generator that works for that input length, we will conveniently ignore the previous generators(we already know those words).
How do we produce these logits?
If we consider what a generator is supposed to represent, then it will be the logistic distribution over every available token so that the best token can be found. The ingredient here is the universal(trained) knowledge about all available tokens in a vocabulary, as well as the knowledge gathered for the next word from the context given by the previous sequence of tokens. This is basically a similarity question, where we first define all the properties that the next token is supposed to have and find out how much any available token based on their training satisfies those properties or have similar properties.
What if we imagine each tokens having some fixed number of imaginary properties all of which are convertible to numbers? This would embed every available token into an imaginary hidden space of certain high dimension. The dimension of this hidden space is going to called hidden size from now on. Here, tokens are basically getting embedded as points into this hidden space. If derive, through some logical deduction or some method, where the target point should roughly be in this high dimensional space, we can create our logits by just using cosine similarity or better simple dot product. Then the problem simplifies to having a big matrix of dimension hidden size times vocabulary size getting multiplied by a matrix of sequence length times hidden size to produce generators at the end of every subsequence head. We just take the last row of this product to get the required logits.
Representation of next word
Now the problem clearly turns to that of finding the properties of the required next word, so that it can be converted into a vector in the hidden space of trained knowledge. This is the job of a decoder-only transformer which decodes the properties of the next word from the sequence of words given to it. The core functionality of decoder-only transformer is to be studied later, but at its core its functionality is to be an engine that associate with each query multiple key and value pairs. If the requirement of next word through a sequence of tokens is converted into a query, then the instance that matches the query to key returns a value also within the hidden space of knowledge. But this process is repeated in parts and in sequence multiple times to ensure that the contextual knowledge vector for the next word in the hidden space is properly built. The reason why this process is executed in parts and in sequence multiple times is because there are constraints imposed by the English Language on the inter-relationship between tokens, as well as the progress of conversation in which the information about various things were given. What encoder-only transformers do is to expose these inter-relationships allowing us to discover the association between queries made and the key to the information sought, enabling us to retrieve such information letting us obtain the properties of the next word in the hidden space.
Head and Body of Text Generators
There are two parts of text generators that are studied in this article. The part that embeds the whole vocabulary into the hidden space is the head of the text generator that essentially stores the information or trained knowledge about the individual tokens absorbed by the text generators. The body is the engine that studies the interrelationship between the tokens appearing in the sequence of words in a conversation and produces a vector representation in the hidden space of knowledge for the next word that is to be generated. The body that is showing prominent results are mostly the decoder-only transformers that we are going to study in the next article in details.