登录查看更多内容

?? Decoding Tokenization: The Building Block of Large Language Models (LLMs) ??

Ankur Bhargava

AI Specialist | LLM | Gen AI

发布日期: 2025年1月14日

Today, let’s dive into one of the foundational aspects of LLMs: Tokenization.

Imagine taking a vast, complex puzzle (text) and breaking it into smaller, manageable pieces—these pieces are called tokens. Tokens are the fundamental units that LLMs process, making them the input and output layers' currency. Let's explore the different methods of tokenization and their nuances:

1?? Word Tokens

?? What it does: Splits text into individual words.

?? Example: "Tokenization is fun" → ["Tokenization", "is", "fun"].

?? Drawback: Struggles with out-of-vocabulary (OOV) words. For instance, it wouldn’t know what to do with “AImazing” if it wasn’t in its vocabulary.

?? Fun Fact: Word2Vec used this method back in the day!

2?? Subword Tokens

?? What it does: Breaks text into partial words when needed.

?? Example: "AImazing" → ["AI", "##mazing"].

?? Advantage: Handles OOV words by breaking them into known subword components.

?? Pro tip: A special symbol (e.g., “##”) indicates a token is a continuation of the previous token.

3?? Character Tokens

?? What it does: Breaks text into individual characters.

?? Example: "Token" → ["T", "o", "k", "e", "n"].

?? Advantage: Handles all new words effortlessly.

?? Challenge: Computationally expensive and requires longer training times.

4?? Byte Tokens

Byte Tokens

?? What it is: Splits text into bytes, the smallest data units, using Byte Pair Encoding (BPE).

?? How it works: Frequently occurring byte pairs are merged iteratively until a fixed vocabulary size is reached.

?? Pros:

Handles unseen words and rare symbols (e.g., emojis, tabs).
Supports multilingual text efficiently.

?? Cons:

Less intuitive to interpret individual tokens.
Training can be slower for longer tokens.

?? Used in: GPT-4, StarCoder2, and Llama2 for tasks ranging from natural language processing to code generation.

领英推荐

Advanced Prompting Techniques in Large Language Models

Sanjay Kumar MBA,MS,PhD 6 个月前

Crafting Intelligence: The Art of Tailoring Large…

Sanjay Kumar MBA,MS,PhD 1 年前

Mastering Prompt Engineering Techniques – Part 2

Factspan 1 个月前

Tokenization in Action: LLM Highlights ??

BERT (2018)

?? Method: WordPiece (a variant of BPE). ?? Features:

Converts text to lowercase.
Uses special tokens like [CLS] for classification tasks.
Captures sentence-level meaning effectively. ?? Example: "Capitalization" → ["capital", "##ization"].

GPT-4

?? Method: BPE. ?? Features:

Preserves capitalization and special symbols.
Introduces newline tokens (useful for structured text).
Excels in coding tasks, handling tabs and special characters. ?? Notable Special Token: "Fill-in-the-middle" enables it to generate code by understanding the context from both preceding and succeeding text.

Example

def fibonacci(n: int):  
    """Return the Fibonacci series till n."""  
    # Before code is prefix...  
    # <LLM generates the logic for Fibonacci series here>  
    # After code is suffix  
n = int(input("Enter a number: "))  
print(fibonacci(n))

GPT-4 fills in the missing logic, like:

def fibonacci(n: int):  
    """Return the Fibonacci series up to n."""  
    sequence = [0, 1]  
    while sequence[-1] + sequence[-2] < n:  
        sequence.append(sequence[-1] + sequence[-2])  
    return sequence

StarCoder2

?? Specializes in generating code! ?? Features:

BPE-based tokenization.
Manages context across multiple files using special tokens for repository and file names.
Great for navigating codebases and collaborating in repositories.

LLaMA2

?? Features:

BPE-based tokenization.
32K vocabulary size—big enough to handle complex tasks!
Special tokens for chat applications:<|user|><|assistant|><|system|>

?? Takeaway: Tokenization is the secret sauce of LLMs, converting text into a format they can understand while retaining meaning. It’s fascinating how each model tailors tokenization to its use case, from conversational AI to complex code generation.

?? What’s your favorite tokenization method? Or, which of these LLMs fascinates you the most? Let’s discuss in the comments!

Stay tuned for more deep dives into the world of LLMs! ??

#Tokenization #AI #LargeLanguageModels #GenerativeAI #NLP

Pictures and contents are taken from Chapter 2 of Jay Alammar's book - Hands On Large Language Model

要查看或添加评论，请登录

Ankur Bhargava的更多文章

Empowering LLMs with Tools: The Agentic Path to Smarter AI

2025年1月8日

Empowering LLMs with Tools: The Agentic Path to Smarter AI

The true potential of Large Language Models (LLMs) lies not just in their ability to process language but in how they…
Large Language Model Embeddings Fundamentals

2024年11月5日

Large Language Model Embeddings Fundamentals

Imagine an intricate web, woven from threads of words and meaning, stretching infinitely across a hidden landscape…

1 条评论
Critical Pain Points in Retrieval Augmented Generation (RAG)

2024年3月7日

Critical Pain Points in Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) stands as a pinnacle in harnessing the power of Large Language Models (LLMs) to…

2 条评论
ROUGE and BLEU Score

2023年7月12日

ROUGE and BLEU Score

Let's dive into the world of evaluating text generated from Large Language Models (LLMs) and explore the metrics that…

1 条评论
Results to Decision - A/B Test

2023年3月17日

Results to Decision - A/B Test

Few Days back, I wrote an article on how to perform an A/B testing. Once we have done our experiment, now it is the…
Basics and Example of A/B Test

2023年2月21日

Basics and Example of A/B Test

In this article, we will be covering the basics of A/B testing. Before understanding the basics and various aspects of…

2 条评论
Training Data

2022年2月18日

Training Data

In the chapter 3, Training Data, of the book Designing Machine Learning Systems, author Chip Huyen has talked about how…
Basics of Data Engineering

2022年2月15日

Basics of Data Engineering

In the chapter 2, Data Engineering Fundamentals, of the book Designing Machine Learning Systems, author Chip Huyen has…

1 条评论
Designing Machine Learning Systems

2022年2月11日

Designing Machine Learning Systems

Designing Machine Leaning Systems is an amazing and insightful book written by Chip Huyen. It's a wonderful book if…

6 条评论

See all articles

?? Decoding Tokenization: The Building Block of Large Language Models (LLMs) ??

Ankur Bhargava

AI Specialist | LLM | Gen AI

1?? Word Tokens

2?? Subword Tokens

3?? Character Tokens

4?? Byte Tokens

Byte Tokens

领英推荐

Tokenization in Action: LLM Highlights ??

BERT (2018)

GPT-4

StarCoder2

LLaMA2

Ankur Bhargava的更多文章

社区洞察

其他会员也浏览了

How to prompt like a pro: Why do different language models react differently?

DeepSeek-R1: The Open-Source AI That’s Redefining Innovation

Prompt Engineering: Unlocking the Power of Large Language Models

Is LLM close to AGI?

Training, Tuning, and Retrieval: How Large Language Models Get Smart

How To Use Prompt Engineering With Large Language Models

Building an Industry-Specific Large Language Model (LLM) from Scratch Using Claude.AI

Prompt Compression in Large Language Models

HOW TO FINE-TUNE LLAMA 2 AND UNLOCK ITS FULL POTENTIAL

BEHOLD THE MARVEL OF GPT-4

1?? Word Tokens

2?? Subword Tokens

3?? Character Tokens

4?? Byte Tokens

Byte Tokens

领英推荐

Tokenization in Action: LLM Highlights ??

BERT (2018)

GPT-4

StarCoder2

LLaMA2

Ankur Bhargava的更多文章

Empowering LLMs with Tools: The Agentic Path to Smarter AI

Large Language Model Embeddings Fundamentals

Critical Pain Points in Retrieval Augmented Generation (RAG)

ROUGE and BLEU Score

Results to Decision - A/B Test

Basics and Example of A/B Test

Training Data

Basics of Data Engineering

Designing Machine Learning Systems

社区洞察

其他会员也浏览了

How to prompt like a pro: Why do different language models react differently?

DeepSeek-R1: The Open-Source AI That’s Redefining Innovation

Prompt Engineering: Unlocking the Power of Large Language Models

Is LLM close to AGI?

Training, Tuning, and Retrieval: How Large Language Models Get Smart

How To Use Prompt Engineering With Large Language Models

Building an Industry-Specific Large Language Model (LLM) from Scratch Using Claude.AI

Prompt Compression in Large Language Models

HOW TO FINE-TUNE LLAMA 2 AND UNLOCK ITS FULL POTENTIAL

BEHOLD THE MARVEL OF GPT-4