Tokenization: How Tokens Shape AI Efficiency and Cost
Matias Undurraga Breitling
Enterprise Technologist @ AWS | Transformation, Strategic Tech Planning
"Not all tokenizers are created equal thus enter discussion"
In the diverse landscape of generative AI, understanding tokens and their influence on AI models and pricing is crucial. This article aims to shed some light on the concept of tokens, explore how they vary across different languages, models, and reveal their impact on the cost of using AI systems.
What is a Token?
In the context of generative AI, a token is a unit of text—be it a word, part of a word, or even just a character that is used to break down natural language into manageable "chunks"
Example of how a token might look like:
We have been told that 1 token for english language is normally is 0.75 of a word or you might of heard 1 token is 4 characters, as we move forward you will see clear examples.
When you input text into an AI model, the tokenizer breaks down the text into these manageable pieces, allowing the model to process and analyze the information more effectively and predict the next token.
You might want to see what we mean with next token prediction:
Tokenizer in an AI Model
A tokenizer is a tool within AI models that splits text into tokens. This process is essential because it transforms raw text into a structured format that the AI can understand. Different AI models may use different tokenization methods depending on their architecture and the tasks they are designed to perform. For example, some models might tokenize at the word level, while others focus on subwords or characters, impacting how the AI interprets the input.
Here is an example when comparing 3 models from AI21 Labs, Amazon, Meta. I used the same prompt for the 3 models "Hello, could you provide me with a short overview of Amazon's history?". The input token count for the 3 models varies from 10 to 29 token.
Now, I wanted to understand the impact of umlaut a on a word and the token count for german umlaut. I tried with the word H?user meaning houses, I also tried with Hauser to understand the impact of ?. The token count would change for +1, but also the context and response was way off since the umlaut provide relevant context.
领英推荐
In German, you can substitute "?" with "ae." This is a common practice when special characters like umlauts (?, ?, ü) are not available on a keyboard, or in situations where only ASCII characters are permitted. In this case the count remains similar to umlaut but lost all context.
Now the interesting part is instead of using H?user, I use the english word Houses - both having 6 characters the token count decreases as the language changes.
Token Differentiation in Languages or Using Special Characters
Tokenization becomes particularly interesting when dealing with different languages or special characters. For instance, the word "über" in German is tokenized differently than "uber" in English due to the special character "ü." In some models, "über" might be split into more tokens than "uber," reflecting the complexity added by special characters or accents. This differentiation is crucial in multilingual AI systems, as it affects the model's ability to understand and generate text accurately across different languages.
Understanding Input and Output in AI Models
In discussions about generative AI models, terms like "context window", "input max window" or "output max window" frequently come up, referring to the maximum number of tokens a model can process in a single batch or the output of that process. This token count is crucial for determining how much text the model can handle at once. However, the efficiency of the tokenizer plays a significant role in how these numbers translate into actual performance. A highly efficient tokenizer might break down text into fewer tokens, making a 28k token window nearly equivalent in functional capacity to a less efficient tokenizer with a 32k window. Thus, when comparing models, it's important to consider not just the maximum token count but also how the tokenizer handles different types of text and language - as this can greatly affect the context window.
The Influence of Token in Price Per Token
The cost of using a generative AI model often depends on the number of tokens processed. This pricing model means that the way text is tokenized can directly impact the cost of an operation. More complex tokenization that results in a higher token count could lead to higher costs for the same text. For businesses and developers, understanding this relationship is essential for optimizing expenses, especially when processing large volumes of text.
Here is an example of prices, models but we never mention the efficiency of tokenizers which would make this pricing comparable?
Takeaway: Test Different Models, Simplify Input
To effectively manage costs and performance, it's beneficial to experiment with different AI models to see how they tokenize text. Simplifying the input text, such as by standardizing characters or removing special characters, removing stop words, creating summaries of input text will reduce token counts and, consequently, costs but you might loose context as we saw previously with the
example with H?user and Hauser.
You can compress prompts as seen in LLMLingua below:
Or you can improve the model architecture to achieve better results having a "better" tokenizer.
... Llama 3 uses a tokenizer with a vocabulary of 128K tokens that encodes language much more efficiently, which leads to substantially improved model performance...
Insurance - Product Design - MIC Global
6 个月If you agents and are using OpenAi API look at the cost, work out what you should be paying through OpenAi cost summary. Make a 10 API request and see how much it actually cost you. You may be paying extra tokens for agents.
SEO Specialist | Driving Online Visibility
7 个月I love diving deep into the nuances of tokenization methods. Let's keep exploring these fascinating differences. ?? #AI #MachineLearning
Exciting insights. Can't wait to dive into the details of tokenization models. ??
CEO at Cognitive.Ai | Building Next-Generation AI Services | Available for Podcast Interviews | Partnering with Top-Tier Brands to Shape the Future
7 个月Tokenization varies across models, creating intriguing differences. A token efficiency ratio could indeed enhance model comparison. What's your take on this fascinating subject? Matias Undurraga Breitling