How exactly LLM generates text?
Ivan Reznikov
PhD, Principal Data Scientist || TEDx/PyCon/GITEX Speaker || University Lecturer || LangChain, Large Language Models (LLMs) and Generative AI || 30K+ followers
This article won't discuss transformers or how large language models are trained. Instead, we will concentrate on using a pre-trained model.
Let's look at the text generation overview.?
Today, we'll concentrate on the third step - decoding and generating text. If you're interested in the first two steps, leave a comment, I'll also consider covering those topics.
Decoding the outputs
Let's dive now a bit deeper. Say, we want to generate the continuation of the phrase "Paris is the city ...". The Encoder (we'll be using?Bloom-560m?model (link to code in the comments)) sends logits for all the tokens we have (if you don't know what logits are —?consider them as scores) that can be converted, using softmax function, to probabilities of the token being selected for generation.
If you look at the top 5 output tokens, they all make sense. We can generate the following phrases that sound legit:
The challenge now is to select the appropriate token. And there are several strategies for that.
Greedy sampling
Simply put, in a greedy strategy, the model always chooses the token it believes is the most probable at each step — it doesn't consider other possibilities or explore different options. The model selects the token with the highest probability and continues generating text based on the selected choice.
Using a greedy strategy is computationally efficient and straightforward, but it comes with the cost of getting repetitive or overly deterministic outputs occasionally. Since the model only considers the most probable token at each step, it may not capture the full diversity of the context and language or produce the most creative responses. The model's short-sighted nature solely focuses on the most probable token at each step, disregarding the overall impact on the entire sequence.
Generated output:?Paris is the city of the future. The
Beam search
Beam search is another strategy used in text generation. In beam search, instead of just considering the most likely token at each step, the model considers a set of the top "k" most probable tokens. This set of k tokens is called a "beam."?
The model generates possible sequences for each token and keeps track of their probabilities at each step of text generation by expanding possible sequences for each beam.
This process continues until the generated text's desired length is reached or an "end" token is encountered for each beam. The model selects the sequence with the highest overall probability from all the beams as the final output.
From an algorithmic perspective, creating beams is expanding a k-nary tree. After the beams are created, you select the branch with the highest overall probability.?
Generated output:?Paris is the city of history and culture.
领英推荐
Normal random sampling or direct use of probability
The idea is straightforward — you select the next word by choosing a random value and mapping it to the token got picket. Imagine it as spinning a wheel, where the area of each token is defined by its probability. The higher the probability — the more chances the token would get selected. It is a relatively cheap computational solution, and due to high relative randomness - the sentences (or token sequence) most probably be different every time.
Random sampling with Temperature
As you might recall, we've been using the softmax function to convert logits to probabilities. And here, we introduce temperature — a hyperparameter that affects the randomness of the text generation. Let's compare the activation functions to understand better how temperature affects our probability calculations.
As you may notice, the difference is in the denominator - we divide by T. Higher values of temperature (e.g., 1.0) make the output more diverse, while lower values (e.g., 0.1) make it more focused and deterministic. In fact, T = 1 will lead to the initial softmax function we used initially.
Top-k sampling?
We can now shift probabilities with temperature. Another enhancement is to use top-k tokens rather than all of them. This will increase the stability of the text generation, not decreasing creativity too much. Basically, it's now random sampling with temperature for only top k tokens. The only possible issue might be selecting the number k, and here is how we can make it better.
Nucleus sampling or top-p sampling
The distribution of token probabilities might be very different, what can bring some unexpected results while text generation.
Nucleus sampling is designed to address some limitations of different sampling techniques. Instead of specifying a fixed number of "k" tokens to consider, a probability threshold "p" is used. This threshold represents the cumulative probability that you want to include in the sampling. The model calculates the probabilities of all possible tokens at each step and then sorts them in descending order.
The model continues adding tokens to the generated text until the sum of their probabilities surpasses the specified threshold. The advantage of nucleus sampling is that it allows for more dynamic and adaptive token selection based on the context. The number of tokens selected at each step can vary depending on the probabilities of the tokens in that context, which can lead to more diverse and higher-quality outputs.
Conclusion
Decoding strategies are crucial in text generation, primarily when used with pre-trained language models. If you think about it, we have several ways to define probabilities, several ways to use those probabilities, and at least two ways to define how many tokens to take into account. I'm leaving a summary table below to wrap up the knowledge.
Temperature controls the randomness of token selection during decodingHigher temperature boosts creativity, whereas lower temperature is more about coherence and structure. While embracing creativity allows for fascinating linguistic adventures, tempering it with stability ensures the elegance of the generated text.
I would appreciate your support if you've enjoyed the illustrations made and the article content. Until next time!
Data Engineer @ A.P. Moller - Maersk | Integrated logistics| Data Analytics | Certified Agile Coach | PSM2 | IIMM | Lifelong Learner
11 个月Thanks, Ivan Reznikov for this great content and explanation
Professor für generative künstliche Intelligenz & Leiter Departement Informatik FFHS
1 年Great article - thanks for sharing!
Data Science Manager | Author of Best Selling Book | AI Researcher | Developing Enterprise GenAI / LLM Products
1 年Temperature does the magic ?? Great article
Thanks for Sharing! ?? Ivan Reznikov