登录查看更多内容

Tokenization: How Tokens Shape AI Efficiency and Cost

Matias Undurraga Breitling

Enterprise Technologist @ AWS | Transformation, Strategic Tech Planning

发布日期: 2024年4月29日

"Not all tokenizers are created equal thus enter discussion"

In the diverse landscape of generative AI, understanding tokens and their influence on AI models and pricing is crucial. This article aims to shed some light on the concept of tokens, explore how they vary across different languages, models, and reveal their impact on the cost of using AI systems.

What is a Token?

In the context of generative AI, a token is a unit of text—be it a word, part of a word, or even just a character that is used to break down natural language into manageable "chunks"

Example of how a token might look like:

We have been told that 1 token for english language is normally is 0.75 of a word or you might of heard 1 token is 4 characters, as we move forward you will see clear examples.

1 token ~= 4 chars in English
1 token ~= ? words
100 tokens ~= 75 words

When you input text into an AI model, the tokenizer breaks down the text into these manageable pieces, allowing the model to process and analyze the information more effectively and predict the next token.

You might want to see what we mean with next token prediction:

Tokenizer in an AI Model

A tokenizer is a tool within AI models that splits text into tokens. This process is essential because it transforms raw text into a structured format that the AI can understand. Different AI models may use different tokenization methods depending on their architecture and the tasks they are designed to perform. For example, some models might tokenize at the word level, while others focus on subwords or characters, impacting how the AI interprets the input.

Here is an example when comparing 3 models from AI21 Labs, Amazon, Meta. I used the same prompt for the 3 models "Hello, could you provide me with a short overview of Amazon's history?". The input token count for the 3 models varies from 10 to 29 token.

https://us-west-2.console.aws.amazon.com/bedrock/home?region=us-west-2#/chat-playground

Now, I wanted to understand the impact of umlaut a on a word and the token count for german umlaut. I tried with the word H?user meaning houses, I also tried with Hauser to understand the impact of ?. The token count would change for +1, but also the context and response was way off since the umlaut provide relevant context.

Brij kishore Pandey 2 个月前

A Free Massive New Language Model; Moder Data…

Steve Nouri 2 年前

Seeing Is Believing: The Multimodal AI Evolution

Peterson Technology Partners 9 个月前

In German, you can substitute "?" with "ae." This is a common practice when special characters like umlauts (?, ?, ü) are not available on a keyboard, or in situations where only ASCII characters are permitted. In this case the count remains similar to umlaut but lost all context.

Now the interesting part is instead of using H?user, I use the english word Houses - both having 6 characters the token count decreases as the language changes.

Token Differentiation in Languages or Using Special Characters

Tokenization becomes particularly interesting when dealing with different languages or special characters. For instance, the word "über" in German is tokenized differently than "uber" in English due to the special character "ü." In some models, "über" might be split into more tokens than "uber," reflecting the complexity added by special characters or accents. This differentiation is crucial in multilingual AI systems, as it affects the model's ability to understand and generate text accurately across different languages.

Understanding Input and Output in AI Models

In discussions about generative AI models, terms like "context window", "input max window" or "output max window" frequently come up, referring to the maximum number of tokens a model can process in a single batch or the output of that process. This token count is crucial for determining how much text the model can handle at once. However, the efficiency of the tokenizer plays a significant role in how these numbers translate into actual performance. A highly efficient tokenizer might break down text into fewer tokens, making a 28k token window nearly equivalent in functional capacity to a less efficient tokenizer with a 32k window. Thus, when comparing models, it's important to consider not just the maximum token count but also how the tokenizer handles different types of text and language - as this can greatly affect the context window.

The Influence of Token in Price Per Token

The cost of using a generative AI model often depends on the number of tokens processed. This pricing model means that the way text is tokenized can directly impact the cost of an operation. More complex tokenization that results in a higher token count could lead to higher costs for the same text. For businesses and developers, understanding this relationship is essential for optimizing expenses, especially when processing large volumes of text.

Here is an example of prices, models but we never mention the efficiency of tokenizers which would make this pricing comparable?

https://medium.com/@daniellefranca96/commercial-models-price-comparison-dc5837acc7b6

Takeaway: Test Different Models, Simplify Input

To effectively manage costs and performance, it's beneficial to experiment with different AI models to see how they tokenize text. Simplifying the input text, such as by standardizing characters or removing special characters, removing stop words, creating summaries of input text will reduce token counts and, consequently, costs but you might loose context as we saw previously with the

example with H?user and Hauser.

You can compress prompts as seen in LLMLingua below:

Or you can improve the model architecture to achieve better results having a "better" tokenizer.

... Llama 3 uses a tokenizer with a vocabulary of 128K tokens that encodes language much more efficiently, which leads to substantially improved model performance...

Jonathon Croydon

Insurance - Product Design - MIC Global

6 个月

If you agents and are using OpenAi API look at the cost, work out what you should be paying through OpenAi cost summary. Make a 10 API request and see how much it actually cost you. You may be paying extra tokens for agents.

1 次回应

Basant Kumar

SEO Specialist | Driving Online Visibility

7 个月

I love diving deep into the nuances of tokenization methods. Let's keep exploring these fascinating differences. ?? #AI #MachineLearning

1 次回应

Dennis R.

7 个月

Exciting insights. Can't wait to dive into the details of tokenization models. ??

1 次回应

Vincent Valentine ??

CEO at Cognitive.Ai | Building Next-Generation AI Services | Available for Podcast Interviews | Partnering with Top-Tier Brands to Shape the Future

7 个月

Tokenization varies across models, creating intriguing differences. A token efficiency ratio could indeed enhance model comparison. What's your take on this fascinating subject? Matias Undurraga Breitling

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Tokenization: How Tokens Shape AI Efficiency and Cost

Matias Undurraga Breitling

Enterprise Technologist @ AWS | Transformation, Strategic Tech Planning

What is a Token?

Tokenizer in an AI Model

领英推荐

Token Differentiation in Languages or Using Special Characters

Understanding Input and Output in AI Models

The Influence of Token in Price Per Token

Takeaway: Test Different Models, Simplify Input

更多精彩文章

社区洞察

其他会员也浏览了

Continuous LLM Monitoring - Observability to ensure Responsible AI

GPTNext in November 2024 and should we pull the plug?!

GenAI Weekly — Edition 27

No Connection, No Problem: AI Solutions with GPT4All and KNIME

Exploring the Advanced Variants of Retrieval-Augmented Generation (RAG)

Unleashing Generative AI's Potential: LLM Chains, Agentic AI, and the Future of AI Product Architecture

LLMs: How open are they really?

How AI integrates into our data design process

Spending on AI is most likely different than you thought.

On Botshit

What is a Token?

Tokenizer in an AI Model

领英推荐

Token Differentiation in Languages or Using Special Characters

Understanding Input and Output in AI Models

The Influence of Token in Price Per Token

Takeaway: Test Different Models, Simplify Input

When generative AI is and not is effective

2024年6月4日

Velocity vs Speed

2024年5月27日

The Key to Success in the Digital Age: Agile Decision-Making

2024年4月22日

How Generative AI is Transforming Contact Centres

2024年4月15日

Data - Gravity

2024年4月8日

Investment options for your startup

2024年4月5日

What are COGS for a software company?

2024年4月1日

History Teaches us to Embrace Agility in Technology

2024年3月11日

Search Functionality: Beyond Meals

2024年3月4日

Secret Sauce of Online Ordering: Menu's

2024年2月28日

社区洞察

其他会员也浏览了

Continuous LLM Monitoring - Observability to ensure Responsible AI

GPTNext in November 2024 and should we pull the plug?!

GenAI Weekly — Edition 27

No Connection, No Problem: AI Solutions with GPT4All and KNIME

Exploring the Advanced Variants of Retrieval-Augmented Generation (RAG)

Unleashing Generative AI's Potential: LLM Chains, Agentic AI, and the Future of AI Product Architecture

LLMs: How open are they really?

How AI integrates into our data design process

Spending on AI is most likely different than you thought.

On Botshit