Top K, Top P, Temperature
Anuradha Mohapatra
Generative AI Consultant at HCL | Data Science | NLP | Machine Learning | Six Sigma Black Belt certified (ASQ)
Do you remember, during the 90s in Indian households, we were tuning the TV antenna to get a better visual or audio output on the TV? Keeping the same nostalgic thought in mind, let's discuss a few parameters that influence LLM output.
While working with LLMs in playgrounds such as Gemini, Azure, and Hugging Face, you may have observed controls/sliders to adjust the LLM output. Parameters like max_tokens, top_k, or top_p are mentioned in each slider. Next word generation is how the LLM generates the answer to the user. Let's see how these parameters influence our LLM model in generating the output.
Please don't confuse these parameters with the training parameters. The training parameters are learned during the training time. However, these configuration parameters are invoked at inference time and give you control over things like how creative the output can be and the maximum number of tokens in the completion.
Max_new_tokens is probably the simplest of these parameters, and you can use it to limit the number of tokens that the model will generate.
Let's discuss Top_k and Top_p.
The entire dictionary of words that the model uses has its probability distribution. (Well, we are talking about the output from the transformer's SoftMax layer. To understand how you are getting the output, please read the "Attention is All You Need" paper.)
Say, for example, we have the following words and their probability distribution:
领英推è
Apple: 0.2, Orange: 0.1, Blueberry: 0.02, Banana: 0.03
The straightforward approach would be for the model to always choose the word with the highest probability. However, that implies repeated sequences of words. So, here, random sampling can be used to introduce some randomness. But at the same time, there has to be some balance between the randomness so that it should produce some sensible output. Two settings, top_p and top_k, are sampling techniques that we can use to help limit the random sampling and increase the chance that the output will be sensible.
You can specify a top_k value, which instructs the model to choose from only the k tokens with the highest probability. Say, if you have set k to 3, you are instructing the model to choose from the 3 tokens with the highest probability only. If you refer to the example here, the 3 highest probability words are Apple: 0.2, Orange: 0.1, and Banana: 0.03.
This approach will introduce some randomness but at the same time emphasize highly probable completion words. This, in turn, makes your text generation more likely to sound reasonable and thus makes sense.
We can have a Top_p setting as well, where to restrict the random sampling, we choose combined probabilities that do not exceed a p value that is less than or equal to p (<=p). Say, if you have set p to 0.3, you are instructing the model to choose from the combined probabilities. So, as per our example, the options are Apple: 0.2, Orange: 0.1.
One more parameter that you can use to control the randomness of the model output is temperature. This parameter influences the shape of the probability distribution that the model calculates for the next token. So, if you opt for a higher value of temperature, higher is the randomness. But if you want to set the temperature value low, the randomness will decrease. It also depends on the objective of your task. If you are a writer and want to load your content with creativity, you will use a higher temperature setting. But if you are working on scientific content, implies less temperature, less randomness, and a straight answer are better for you.
I hope this briefing helps you to understand these concepts better. Please share your input and thoughts.
Principal Consultant at AI & Cloud Native Labs - HCL Tech| Gen AI | Cloud Architect | AWS
7 个月Very Informative ??