How exactly LLM generates text?

Ivan Reznikov

PhD, Principal Data Scientist || TEDx/PyCon/GITEX Speaker || University Lecturer || LangChain, Large Language Models (LLMs) and Generative AI || 30K+ followers

发布日期: 2023年7月27日

This article won't discuss transformers or how large language models are trained. Instead, we will concentrate on using a pre-trained model.

Let's look at the text generation overview.?

The input text is passed to a tokenizer that generates token_id outputs, where each token_id is assigned as a unique numerical representation.
The tokenized input text is passed to the Encoder part of the pre-trained model. The Encoder processes the input and generates a feature representation that encodes the meaning and context of the input. The Encoder was trained on large amounts of data, which we benefit from.
The Decoder takes the feature representation from the Encoder and starts generating new text based on that context token by token. It uses previously generated tokens to create new tokens.

Today, we'll concentrate on the third step - decoding and generating text. If you're interested in the first two steps, leave a comment, I'll also consider covering those topics.

Decoding the outputs

Let's dive now a bit deeper. Say, we want to generate the continuation of the phrase "Paris is the city ...". The Encoder (we'll be using?Bloom-560m?model (link to code in the comments)) sends logits for all the tokens we have (if you don't know what logits are —?consider them as scores) that can be converted, using softmax function, to probabilities of the token being selected for generation.

If you look at the top 5 output tokens, they all make sense. We can generate the following phrases that sound legit:

Paris is the city?of?love.
Paris is the city?that?never sleeps.
Paris is the city?where?art and culture flourish.
Paris is the city?with?iconic landmarks.
Paris is the city?in?which history has a unique charm.

The challenge now is to select the appropriate token. And there are several strategies for that.

Greedy sampling

Simply put, in a greedy strategy, the model always chooses the token it believes is the most probable at each step — it doesn't consider other possibilities or explore different options. The model selects the token with the highest probability and continues generating text based on the selected choice.

Using a greedy strategy is computationally efficient and straightforward, but it comes with the cost of getting repetitive or overly deterministic outputs occasionally. Since the model only considers the most probable token at each step, it may not capture the full diversity of the context and language or produce the most creative responses. The model's short-sighted nature solely focuses on the most probable token at each step, disregarding the overall impact on the entire sequence.

Generated output:?Paris is the city of the future. The

Beam search

Beam search is another strategy used in text generation. In beam search, instead of just considering the most likely token at each step, the model considers a set of the top "k" most probable tokens. This set of k tokens is called a "beam."?

The model generates possible sequences for each token and keeps track of their probabilities at each step of text generation by expanding possible sequences for each beam.

This process continues until the generated text's desired length is reached or an "end" token is encountered for each beam. The model selects the sequence with the highest overall probability from all the beams as the final output.

From an algorithmic perspective, creating beams is expanding a k-nary tree. After the beams are created, you select the branch with the highest overall probability.?

Generated output:?Paris is the city of history and culture.

Thomas Cherickal 5 个月前

Derrida, Deconstruction, and LLMs

Amram Dworkin 1 个月前

Demystifying Large Language Models

Todd Gee 4 个月前

Normal random sampling or direct use of probability

The idea is straightforward — you select the next word by choosing a random value and mapping it to the token got picket. Imagine it as spinning a wheel, where the area of each token is defined by its probability. The higher the probability — the more chances the token would get selected. It is a relatively cheap computational solution, and due to high relative randomness - the sentences (or token sequence) most probably be different every time.

Random sampling with Temperature

As you might recall, we've been using the softmax function to convert logits to probabilities. And here, we introduce temperature — a hyperparameter that affects the randomness of the text generation. Let's compare the activation functions to understand better how temperature affects our probability calculations.

As you may notice, the difference is in the denominator - we divide by T. Higher values of temperature (e.g., 1.0) make the output more diverse, while lower values (e.g., 0.1) make it more focused and deterministic. In fact, T = 1 will lead to the initial softmax function we used initially.

Top-k sampling?

We can now shift probabilities with temperature. Another enhancement is to use top-k tokens rather than all of them. This will increase the stability of the text generation, not decreasing creativity too much. Basically, it's now random sampling with temperature for only top k tokens. The only possible issue might be selecting the number k, and here is how we can make it better.

Nucleus sampling or top-p sampling

The distribution of token probabilities might be very different, what can bring some unexpected results while text generation.

Nucleus sampling is designed to address some limitations of different sampling techniques. Instead of specifying a fixed number of "k" tokens to consider, a probability threshold "p" is used. This threshold represents the cumulative probability that you want to include in the sampling. The model calculates the probabilities of all possible tokens at each step and then sorts them in descending order.

The model continues adding tokens to the generated text until the sum of their probabilities surpasses the specified threshold. The advantage of nucleus sampling is that it allows for more dynamic and adaptive token selection based on the context. The number of tokens selected at each step can vary depending on the probabilities of the tokens in that context, which can lead to more diverse and higher-quality outputs.

Conclusion

Decoding strategies are crucial in text generation, primarily when used with pre-trained language models. If you think about it, we have several ways to define probabilities, several ways to use those probabilities, and at least two ways to define how many tokens to take into account. I'm leaving a summary table below to wrap up the knowledge.

Temperature controls the randomness of token selection during decodingHigher temperature boosts creativity, whereas lower temperature is more about coherence and structure. While embracing creativity allows for fascinating linguistic adventures, tempering it with stability ensures the elegance of the generated text.

I would appreciate your support if you've enjoyed the illustrations made and the article content. Until next time!

#largelanguagemodels #largelanguagemodel #llm #decoding #chatgpt #generativeai #textgeneration #machinelearning #datascience

Newsletter for ML enthusiasts

11,210 位关注者

Ramanathan Chokkalingam

11 个月

Thanks, Ivan Reznikov for this great content and explanation

1 次回应

Tobias H?berlein

Professor für generative künstliche Intelligenz & Leiter Departement Informatik FFHS

1 年

Great article - thanks for sharing!

1 次回应

Sumit Ranjan

Data Science Manager | Author of Best Selling Book | AI Researcher | Developing Enterprise GenAI / LLM Products

1 年

Temperature does the magic ?? Great article

1 次回应

Juji, Inc.

1 年

Thanks for Sharing! ?? Ivan Reznikov

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

How exactly LLM generates text?

Ivan Reznikov

PhD, Principal Data Scientist || TEDx/PyCon/GITEX Speaker || University Lecturer || LangChain, Large Language Models (LLMs) and Generative AI || 30K+ followers

Decoding the outputs

Greedy sampling

Beam search

领英推荐

Normal random sampling or direct use of probability

Random sampling with Temperature

Top-k sampling?

Nucleus sampling or top-p sampling

Conclusion

Newsletter for ML enthusiasts

11,210 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

GenAI gets really useful P1: Analyze many documents at once

Take a deep breath applies to LLMs as well

Parameters for LLM Models: A Simple Explanation

The RAG magic - Make your LLM output more relevant

Fine-Tuning Language Models (LLMs): Navigating the Terrain of Refinement ??

Guidelines for Evaluating Models Powered by Large Language Models (LLMs)

4 Methods of Prompt Engineering

Bridging the Reasoning Gap: How NLEPs Empower Large Language Models

Make Work Simpler with Large Language Models (LLMs)

Large Language Models - part 2

Decoding the outputs

Greedy sampling

Beam search

领英推荐

Normal random sampling or direct use of probability

Random sampling with Temperature

Top-k sampling?

Nucleus sampling or top-p sampling

Conclusion

Newsletter for ML enthusiasts

11,210 位关注者

5 Reasons Why Sam Altman Might've Been Fired from?OpenAI?

2023年11月18日

How to Fit Large Language Models in Small Memory: Quantization

2023年9月4日

I Caught 16 US Presidents Using ChatGPT

2023年8月2日

Reasons Why You Will Need Linear Algebra as a Data Scientist

2023年3月7日

Hybrid Rule-ML Solutions: A Smarter Way to Run Business

2023年2月27日

ML Systems for Business: A Step-by-Step Guide

2023年2月7日

Data Scientist 2.0: The Evolution of the Role and the Skills Needed to Succeed

2023年1月28日

The Misuse of Terminology in Data Field Job Descriptions

2023年1月23日

Stop Starting, Start Finishing: How To Achieve Your Pet Project Goals

2023年1月15日

Using machine learning to identify the true stars of the 2022 World Cup

2022年12月18日

社区洞察

其他会员也浏览了

GenAI gets really useful P1: Analyze many documents at once

Take a deep breath applies to LLMs as well

Parameters for LLM Models: A Simple Explanation

The RAG magic - Make your LLM output more relevant

Fine-Tuning Language Models (LLMs): Navigating the Terrain of Refinement ??

Guidelines for Evaluating Models Powered by Large Language Models (LLMs)

4 Methods of Prompt Engineering

Bridging the Reasoning Gap: How NLEPs Empower Large Language Models

Make Work Simpler with Large Language Models (LLMs)

Large Language Models - part 2