登录查看更多内容

Understanding why Text Generators work using Decoder-only Transformers

Nitin Aggarwal

CEO & Senior Research Scientist at PradhVrddhi Research | Data Science & AI Solutions Provider | Helping Businesses Automate & Innovate

发布日期: 2024年4月8日

Disclaimer

This article is based on my understanding of text generators and is written based on my attempts to understand transformers, or more specifically text generators. I am not claiming that my knowledge or understanding is perfect. This is an attempt to open a discussion on text generators and to collectively understand them well enough so that the research can be done to improve these text generators.

What is a text generator?

Imagine that you are having a conversation with someone. With every word you speak you describes a piece of information that you are about to give. If you stop mid-sentence, the listener is going to ask a question (if they are actually listening). If one tries really hard, squeeze their eyes 70% open, they will see a query in the form of that information being given. The query made is essentially asking for some information that is already known, so the generator is left with two tasks, identify the index or key where the query matches within the context and/or knowledge, and then return the value that is being sought after. The retrieved value becomes the next word that is to be generated. Perfect! There is a catch, however, we don't want to build a query generator, key-query matcher, value retriever and parser, and lastly next word generator. That's too much work! This is actually the job taken care of beautifully by decoder-only transformers which we are going to discuss in a different article.

Let's start at the end.

Why don't we build a model that takes as an input a sequence of tokens and returns a logistic distribution, of the next word, over all of the vocabulary containing every known token. Thus, it becomes logistic generation. For example, a partial sentence, "She was driving a", can be followed by a car, boat, bus, plane, ship, or and so many other options. A number of other contextual information can be used to build more clear conditional probabilities. That is why we choose the output as a set of logits that can generate a logistic distribution over all tokens. We call them tokens instead of words because our next token can be a punctuation mark, a phrase, figure of speech, a partial word instead of a next word.

What is a next token generator?

Having a logistic distribution is to have certain weights attached to every known token, so that after normalization, a new word can be generated from a given logistic distribution. We are essentially looking for a vector of weights of size equal to the size of our vocabulary of tokens. We are going to call this size as vocabulary size from now on. These vectors are what we are going to call next token generator from now on.

Maximum sequence length for next token generator

Lets rewind, we need a generator for any sequence of tokens of any length!? But do we actually need it for all lengths? We can actually settle for some maximum sequence length, right? If that is not enough we can always build a bulkier model. Let us call this maximum as maximum sequence length, why not? Also, being a human we are also not capable of remembering any amount of words from an ongoing conversation. We have a limit as well, and the little transformer guys are going to have even more sever limit. But on other hand, going forward we keep on processing the concluded conversation. Therefore, we are not really remembering the infinite sequence of words, we are actually processing the earlier part of conversation into our knowledge. This particular functionality is what we optionally choose to ignore for now. So, input sequences are going to have a maximum sequence length of maximum sequence length.

领英推荐

Abstraction and its importance

Deshila Technology Research Institute 6 个月前

How to unlock your lock using CP or MILP ?

Alireza Soroudi, PhD 10 个月前

Dynamic and Static Polymorphism

Rainer Grimm 3 年前

All sequence lengths from 1 to maximum sequence length

If we want to allow the text generator to work for input sequence lengths from one to any sequence length less than maximum sequence length, then we will have to come up with a model that handles it all. How about getting an output where if input is a sequence of token indices, then output try to predict the next word for every starting subsequence of the sequence, i.e. for the sequence of one token (first token) produces a generator of second token, for the sequence of first and second tokens a generator of the third token is produced, for the sequence of first to third tokens the generator of the fourth token is produced, and so on and so forth. That is, until the maximum of maximum sequence length is reached. So, if an input of a certain sequence length is given, then the last output needs to be the generator that works for that input length, we will conveniently ignore the previous generators(we already know those words).

How do we produce these logits?

If we consider what a generator is supposed to represent, then it will be the logistic distribution over every available token so that the best token can be found. The ingredient here is the universal(trained) knowledge about all available tokens in a vocabulary, as well as the knowledge gathered for the next word from the context given by the previous sequence of tokens. This is basically a similarity question, where we first define all the properties that the next token is supposed to have and find out how much any available token based on their training satisfies those properties or have similar properties.

Hidden Space of Trained Knowledge

What if we imagine each tokens having some fixed number of imaginary properties all of which are convertible to numbers? This would embed every available token into an imaginary hidden space of certain high dimension. The dimension of this hidden space is going to called hidden size from now on. Here, tokens are basically getting embedded as points into this hidden space. If derive, through some logical deduction or some method, where the target point should roughly be in this high dimensional space, we can create our logits by just using cosine similarity or better simple dot product. Then the problem simplifies to having a big matrix of dimension hidden size times vocabulary size getting multiplied by a matrix of sequence length times hidden size to produce generators at the end of every subsequence head. We just take the last row of this product to get the required logits.

Representation of next word

Now the problem clearly turns to that of finding the properties of the required next word, so that it can be converted into a vector in the hidden space of trained knowledge. This is the job of a decoder-only transformer which decodes the properties of the next word from the sequence of words given to it. The core functionality of decoder-only transformer is to be studied later, but at its core its functionality is to be an engine that associate with each query multiple key and value pairs. If the requirement of next word through a sequence of tokens is converted into a query, then the instance that matches the query to key returns a value also within the hidden space of knowledge. But this process is repeated in parts and in sequence multiple times to ensure that the contextual knowledge vector for the next word in the hidden space is properly built. The reason why this process is executed in parts and in sequence multiple times is because there are constraints imposed by the English Language on the inter-relationship between tokens, as well as the progress of conversation in which the information about various things were given. What encoder-only transformers do is to expose these inter-relationships allowing us to discover the association between queries made and the key to the information sought, enabling us to retrieve such information letting us obtain the properties of the next word in the hidden space.

Head and Body of Text Generators

There are two parts of text generators that are studied in this article. The part that embeds the whole vocabulary into the hidden space is the head of the text generator that essentially stores the information or trained knowledge about the individual tokens absorbed by the text generators. The body is the engine that studies the interrelationship between the tokens appearing in the sequence of words in a conversation and produces a vector representation in the hidden space of knowledge for the next word that is to be generated. The body that is showing prominent results are mostly the decoder-only transformers that we are going to study in the next article in details.

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

要查看或添加评论，请登录

Nitin Aggarwal的更多文章

Rethinking Neural Networks: Toward a More Human-Like Approach

2025年1月18日

Rethinking Neural Networks: Toward a More Human-Like Approach

Artificial Neural Networks (ANNs) have revolutionized fields like image recognition, language processing, and…

1 条评论
Neural Interfaces: Pushing the Boundaries of Human Potential

2025年1月17日

Neural Interfaces: Pushing the Boundaries of Human Potential

In today’s rapidly advancing technological landscape, the concept of seamlessly integrating the human mind with…

1 条评论
Will AGI End Software Development? Think Again! ?? What Would an Intelligent LinkedIn Look Like?

2025年1月3日

Will AGI End Software Development? Think Again! ?? What Would an Intelligent LinkedIn Look Like?

They say AGI (Artificial General Intelligence) will end software development as we know it. But is that the whole…
How Human Thought Mirrors GPT-4: A Cognitive Blueprint for Text Generation

2024年10月10日

How Human Thought Mirrors GPT-4: A Cognitive Blueprint for Text Generation

The human mind is often described as one of the most complex machines ever known. But what if we could break down the…
Unboxing the Decoder-only Transformers for Text Generators

2024年4月13日

Unboxing the Decoder-only Transformers for Text Generators

What are Text Generators? Text generators are those large language models which produce, for each sequence of tokens, a…
Why I love transformers?

2024年4月7日

Why I love transformers?

I am going to talk based on the little knowledge and understanding I have, and I am going to talk about my impressions.…

See all articles

Understanding why Text Generators work using Decoder-only Transformers

Nitin Aggarwal

CEO & Senior Research Scientist at PradhVrddhi Research | Data Science & AI Solutions Provider | Helping Businesses Automate & Innovate

Disclaimer

What is a text generator?

Let's start at the end.

What is a next token generator?

Maximum sequence length for next token generator

领英推荐

All sequence lengths from 1 to maximum sequence length

How do we produce these logits?

Hidden Space of Trained Knowledge

Representation of next word

Head and Body of Text Generators

Nitin Aggarwal的更多文章

社区洞察

其他会员也浏览了

Dynamic and Static Polymorphism

It’s Complicated

Special Allocators with C++17

Generalized Stochastic Dual Dynamic Programming (G-SDDP) & Multi-Benders Decomposition (MBD). Energy Sector Applications.

The Bank OCR Kata, part 2: Hamming distance to the rescue

OUT OF SEQUENCE LOGIC CHECKER

Automatic Return Type (C++98)

Observable Pattern

Ruby: select, reject, collect. What's the difference?

Disclaimer

What is a text generator?

Let's start at the end.

What is a next token generator?

Maximum sequence length for next token generator

领英推荐

All sequence lengths from 1 to maximum sequence length

How do we produce these logits?

Hidden Space of Trained Knowledge

Representation of next word

Head and Body of Text Generators

Nitin Aggarwal的更多文章

Rethinking Neural Networks: Toward a More Human-Like Approach

Neural Interfaces: Pushing the Boundaries of Human Potential

Will AGI End Software Development? Think Again! ?? What Would an Intelligent LinkedIn Look Like?

How Human Thought Mirrors GPT-4: A Cognitive Blueprint for Text Generation

Unboxing the Decoder-only Transformers for Text Generators

Why I love transformers?

社区洞察

其他会员也浏览了

Dynamic and Static Polymorphism

It’s Complicated

Special Allocators with C++17

Generalized Stochastic Dual Dynamic Programming (G-SDDP) & Multi-Benders Decomposition (MBD). Energy Sector Applications.

The Bank OCR Kata, part 2: Hamming distance to the rescue

OUT OF SEQUENCE LOGIC CHECKER

Automatic Return Type (C++98)

Observable Pattern

Ruby: select, reject, collect. What's the difference?