课程: Introduction to Large Language Models
今天就学习课程吧!
今天就开通帐号,24,700 门业界名师课程任您挑!
What are tokens?
- [Instructor] Large language models generate text word by word, right? Not quite. They generate tokens. So what are tokens? Basically each word is split into sub words and one token corresponds to around four characters of text. Let's head over to the OpenAI website to get a good visual example of what tokens are. So this is the Tokenizer on the OpenAI website. So let me just go ahead and scroll down a bit. Now I'm going to go ahead and enter some text into the Tokenizer. So, tokenization is the process of splitting words into smaller chunks or tokens. Each of the different colors corresponds to a token. So in general, you can see that most words correspond to tokens, which includes the space in front of the word. There are a couple of exceptions. For example, the word tokenization is made up of two tokens, token and ization. The sentence I've typed has 12 words. Now, this corresponds to 14 tokens or 77 characters…