登录查看更多内容

A New Approach to Tokenization

Rudina Seseri

Venture Capital | Technology | Board Director

发布日期: 2024年7月25日

“Tokens,” in the context of AI, are the individual unit into which data is divided for processing. For example, when we type or speak to a transformer model such as Claude or LLaMA, our words are broken down into fundamental tokens representing words and parts of words. The AI then uses these tokens to understand and generate responses.

Traditional Large Language Model (LLM) training involves predicting the next token in a sequence, which is done by processing every previous token individually. However, as AI models – particularly LLMs – grow in size and complexity, so do the computational resources required to train them. In other words, training time increases exponentially with model size, leading to high computational resource costs and slower processing.

Now, as AI moves closer to the business core, the search for faster and more efficient training methods is at the center of a wide range of research. In today’s AI Atlas, I explore an new paper that has excited AI researchers by introducing a new training method for LLMs that could halve the cost of training such models.

??? Overview of the research

The paper introduces “patch-level training,” which involves training the AI model only on specific token types within certain contexts. For example, whole words may be more suitable for general text generation, while sub-words and characters could be better for tasks requiring fine-grained understanding of language nuances. The study found that the choice of token type can significantly influence the model's accuracy and efficiency on different tasks.

Patch-level training compresses multiple tokens into a group, or “patch,” thus reducing the overall sequence length that the model needs to process and drastically reducing the computational load. First, the model is fed short sequences of patches and trained to predict the next patch. Then, the model continues training on the remaining data at the token level to align with how it will be used in practice. The researchers tested this method on a range of model sizes, from 370 million to 2.7 billion parameters, and demonstrated that patch-level training could reduce computational costs upwards of 50% without compromising performance.

Dunelm 8 个月前

AI as a Cultural Asset and the Government's Imperative…

Ariel Arrieta 7 个月前

Unlock New Possibilities of LLM

Dr. Jagreet Kaur 1 年前

?? What is the significance of these findings, and what are their limitations?

Patch-level training offers a promising approach to reducing the computational costs of training LLMs without compromising on performance. Furthermore, by compressing sequences into patches, the method enables more efficient knowledge acquisition, making it a valuable tool for future AI development.

Efficiency: Understanding optimal token types enhances the speed and performance of AI models. For instance, using whole words rather than word parts can reduce computational overhead for tasks requiring rapid processing such as live transcription.
Accuracy: Tailoring token usage to specific tasks can also lead to more precise AI-generated responses. Relying on granular word parts, for example, can improve the model's handling of rare or complex words, enhancing text comprehension.
Usability: These findings can help develop more user-friendly AI applications by aligning them with specific needs. For instance, character-level tokens might be ideal for languages with complex structures, improving AI versatility.

However, these are early results from a controlled study, which may not fully represent real-world scenarios. Important considerations for further research include:

Scalability: Applying optimized tokenization strategies in large-scale applications might present challenges over time. For example, maintaining the effectiveness of token-level customizations across various applications and large datasets can be complex and resource-intensive.
Generalization: Models trained with patch-level tokenization on specific datasets might struggle with new types of data. For example, a model trained on medical images could perform poorly on satellite images because the features and patterns it learned are not applicable to the new context.
Complexity: Adapting AI models to utilize different tokens for specific tasks is technically challenging and requires specialized feature engineering, where important characteristics of data is extracted for use in machine learning. When done poorly, there can actually be a negative impact on resource efficiency and inference speed.

??? Applying these learnings practically

Understanding how tokens work and their impact on AI performance can lead to more advanced and capable AI systems, with significant benefits in applications such as:

Customer service: Efficient token selection enables AI models to provide faster and more accurate responses to customer inquiries, improving overall customer satisfaction.
Data analysis: AI is capable of efficiently processing large volumes of data, empowering businesses to make stronger decisions. Optimized token handling can improve the accuracy and speed of extracting insights from such vast datasets.
Language translation: Improved token handling leads to more accurate translations. By understanding and implementing the best token types for each language, AI can provide nuanced and reliable communication to enterprises that deal across borders.

Rudina's AI Atlas

4,763 位关注者

Jennifer Jordan

Venture Partner, iGlobe Partners - Powering Game Changers | We unearth, invest and grow startups into global leaders.

2 个月

I would add that while some methods to accelerate training will be software driven, there is also a need for fundamentally better processors that manage orders of magnitude more computations, solve memory to compute problems, and reduce energy/heat consumption. Murat Onen's EvaCorp here in Cambridge is one company in hot pursuit with a novel analog chip.

Matt Graham ????

CEO @ Rapid Dev | Helping You Build 10x Faster With No-Code | Need something built? DM Me

2 个月

Great information, thank you!

Kevin Petrie

Vice President of Research at BARC

2 个月

Thanks Rudina, this is very interesting. How is a patch different from a chunk? Data and AI teams have been employing chunking techniques for some time.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

A New Approach to Tokenization

Rudina Seseri

Venture Capital | Technology | Board Director

??? Overview of the research

领英推荐

?? What is the significance of these findings, and what are their limitations?

??? Applying these learnings practically

Rudina's AI Atlas

4,763 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

AI: Separating Facts from Fiction, and Exploring Its Potential

Janky Generative AI: Examining the Failure Modes

Basics of Artificial Intelligence

Generative Artificial Intelligence: More Than You Asked For

Are we on the way to Artificial General Intelligence (AGI)?

#179 Beyond Single Token Predictions: The Future of AI with Meta's Multi-Token Models

AI as a World Ontology Project: from ML, GenAI and LLMs to AI Superintelligence

Exploring AI: The Lowdown

GenAI Tools and their Applications

Getting Started with AI: A Step-by-Step Guide

??? Overview of the research

领英推荐

?? What is the significance of these findings, and what are their limitations?

??? Applying these learnings practically

Rudina's AI Atlas

4,763 位关注者

Using Comgra to Visualize AI

2024年10月3日

Crafting Humanlike Interactions with NaturalSpeech-3

2024年9月19日

SAMBA - A New Chapter for State Space Models

2024年9月5日

Medusa: An AI Technique for Parallel Intelligence

2024年8月22日

How Meta’s New Model Takes Visual Intelligence Beyond the Surface

2024年8月8日

Variational Autoencoders and AI Creativity

2024年7月12日

Seeing the Bigger Picture with Capsule Networks

2024年6月27日

From Noise to Clarity: Finding New Use Cases for Diffusion Models

2024年6月13日

The Current Landscape of Large Language Models

2024年5月30日

How KANs Rethink AI Problem-Solving

2024年5月16日

社区洞察

其他会员也浏览了

AI: Separating Facts from Fiction, and Exploring Its Potential

Janky Generative AI: Examining the Failure Modes

Basics of Artificial Intelligence

Generative Artificial Intelligence: More Than You Asked For

Are we on the way to Artificial General Intelligence (AGI)?

#179 Beyond Single Token Predictions: The Future of AI with Meta's Multi-Token Models

AI as a World Ontology Project: from ML, GenAI and LLMs to AI Superintelligence

Exploring AI: The Lowdown

GenAI Tools and their Applications

Getting Started with AI: A Step-by-Step Guide