How exactly LLM generates text?

How exactly LLM generates text?

This article won't discuss transformers or how large language models are trained. Instead, we will concentrate on using a pre-trained model.

No alt text provided for this image

Let's look at the text generation overview.?

  1. The input text is passed to a tokenizer that generates token_id outputs, where each token_id is assigned as a unique numerical representation.
  2. The tokenized input text is passed to the Encoder part of the pre-trained model. The Encoder processes the input and generates a feature representation that encodes the meaning and context of the input. The Encoder was trained on large amounts of data, which we benefit from.
  3. The Decoder takes the feature representation from the Encoder and starts generating new text based on that context token by token. It uses previously generated tokens to create new tokens.

Today, we'll concentrate on the third step - decoding and generating text. If you're interested in the first two steps, leave a comment, I'll also consider covering those topics.

Decoding the outputs

No alt text provided for this image

Let's dive now a bit deeper. Say, we want to generate the continuation of the phrase "Paris is the city ...". The Encoder (we'll be using?Bloom-560m?model (link to code in the comments)) sends logits for all the tokens we have (if you don't know what logits are —?consider them as scores) that can be converted, using softmax function, to probabilities of the token being selected for generation.

If you look at the top 5 output tokens, they all make sense. We can generate the following phrases that sound legit:

  • Paris is the city?of?love.
  • Paris is the city?that?never sleeps.
  • Paris is the city?where?art and culture flourish.
  • Paris is the city?with?iconic landmarks.
  • Paris is the city?in?which history has a unique charm.

The challenge now is to select the appropriate token. And there are several strategies for that.

Greedy sampling

Simply put, in a greedy strategy, the model always chooses the token it believes is the most probable at each step — it doesn't consider other possibilities or explore different options. The model selects the token with the highest probability and continues generating text based on the selected choice.

No alt text provided for this image

Using a greedy strategy is computationally efficient and straightforward, but it comes with the cost of getting repetitive or overly deterministic outputs occasionally. Since the model only considers the most probable token at each step, it may not capture the full diversity of the context and language or produce the most creative responses. The model's short-sighted nature solely focuses on the most probable token at each step, disregarding the overall impact on the entire sequence.

Generated output:?Paris is the city of the future. The

Beam search

Beam search is another strategy used in text generation. In beam search, instead of just considering the most likely token at each step, the model considers a set of the top "k" most probable tokens. This set of k tokens is called a "beam."?

No alt text provided for this image

The model generates possible sequences for each token and keeps track of their probabilities at each step of text generation by expanding possible sequences for each beam.

This process continues until the generated text's desired length is reached or an "end" token is encountered for each beam. The model selects the sequence with the highest overall probability from all the beams as the final output.

From an algorithmic perspective, creating beams is expanding a k-nary tree. After the beams are created, you select the branch with the highest overall probability.?

Generated output:?Paris is the city of history and culture.

Normal random sampling or direct use of probability

No alt text provided for this image
No alt text provided for this image

The idea is straightforward — you select the next word by choosing a random value and mapping it to the token got picket. Imagine it as spinning a wheel, where the area of each token is defined by its probability. The higher the probability — the more chances the token would get selected. It is a relatively cheap computational solution, and due to high relative randomness - the sentences (or token sequence) most probably be different every time.

Random sampling with Temperature

As you might recall, we've been using the softmax function to convert logits to probabilities. And here, we introduce temperature — a hyperparameter that affects the randomness of the text generation. Let's compare the activation functions to understand better how temperature affects our probability calculations.

No alt text provided for this image

As you may notice, the difference is in the denominator - we divide by T. Higher values of temperature (e.g., 1.0) make the output more diverse, while lower values (e.g., 0.1) make it more focused and deterministic. In fact, T = 1 will lead to the initial softmax function we used initially.

Top-k sampling?

No alt text provided for this image

We can now shift probabilities with temperature. Another enhancement is to use top-k tokens rather than all of them. This will increase the stability of the text generation, not decreasing creativity too much. Basically, it's now random sampling with temperature for only top k tokens. The only possible issue might be selecting the number k, and here is how we can make it better.

Nucleus sampling or top-p sampling

The distribution of token probabilities might be very different, what can bring some unexpected results while text generation.

No alt text provided for this image

Nucleus sampling is designed to address some limitations of different sampling techniques. Instead of specifying a fixed number of "k" tokens to consider, a probability threshold "p" is used. This threshold represents the cumulative probability that you want to include in the sampling. The model calculates the probabilities of all possible tokens at each step and then sorts them in descending order.

No alt text provided for this image

The model continues adding tokens to the generated text until the sum of their probabilities surpasses the specified threshold. The advantage of nucleus sampling is that it allows for more dynamic and adaptive token selection based on the context. The number of tokens selected at each step can vary depending on the probabilities of the tokens in that context, which can lead to more diverse and higher-quality outputs.

Conclusion

Decoding strategies are crucial in text generation, primarily when used with pre-trained language models. If you think about it, we have several ways to define probabilities, several ways to use those probabilities, and at least two ways to define how many tokens to take into account. I'm leaving a summary table below to wrap up the knowledge.

No alt text provided for this image
No alt text provided for this image

Temperature controls the randomness of token selection during decodingHigher temperature boosts creativity, whereas lower temperature is more about coherence and structure. While embracing creativity allows for fascinating linguistic adventures, tempering it with stability ensures the elegance of the generated text.

I would appreciate your support if you've enjoyed the illustrations made and the article content. Until next time!

No alt text provided for this image

#largelanguagemodels #largelanguagemodel #llm #decoding #chatgpt #generativeai #textgeneration #machinelearning #datascience

Ramanathan Chokkalingam

Data Engineer @ A.P. Moller - Maersk | Integrated logistics| Data Analytics | Certified Agile Coach | PSM2 | IIMM | Lifelong Learner

11 个月

Thanks, Ivan Reznikov for this great content and explanation

Tobias H?berlein

Professor für generative künstliche Intelligenz & Leiter Departement Informatik FFHS

1 年

Great article - thanks for sharing!

Sumit Ranjan

Data Science Manager | Author of Best Selling Book | AI Researcher | Developing Enterprise GenAI / LLM Products

1 年

Temperature does the magic ?? Great article

要查看或添加评论,请登录

社区洞察

其他会员也浏览了