登录查看更多内容

?? Introducing TokenFormer: Redefining Efficient Scaling in Transformer Architectures ??

Navdeet Saini

Just a Regular Human Trying to Teach Machines to Think... What Could Go Wrong? ??? || Data Scientist || Gen-AI Aficionado || AI/ML Engineer || NLP Enthusiast || Research Analyst || MTech || PEC, Chandigarh

发布日期: 2024年11月16日

Scaling large Transformer models has always required high computational resources, making it challenging for practical deployment. This paper, "TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters," presents a revolutionary approach to scaling Transformers without retraining from scratch, enabling flexible, efficient, and reusable architectures. Here’s how TokenFormer achieves this and why it’s a game-changer in Transformer scaling!

?? What is TokenFormer?

TokenFormer introduces a Token-Parameter Attention (PAttention) layer where model parameters are treated as tokens in the same way as input data. This design reformulates all linear projections in Transformers to cross-attention layers, allowing input tokens to interact with model parameters more flexibly. This shift enables the model to grow incrementally, scaling efficiently from 124M to 1.4B parameters without retraining, reducing costs and improving scalability.

?? How TokenFormer Works

1. Token-Parameter Attention (PAttention) Mechanism

TokenFormer employs Token-Parameter Attention (PAttention), where model parameters are structured as tokens that the input tokens can interact with directly. By treating parameters as tokens, all linear projections in the Transformer are reformulated into cross-attention layers, allowing the model to grow without needing to retrain from scratch:

Where Kp and Kv are learnable key-value pairs for the model parameters.

This adjustment provides a structured way to introduce new parameters incrementally, allowing TokenFormer to scale from 124M to 1.4B parameters.

2. Incremental Scaling through Zero Initialization

TokenFormer expands by adding new parameter tokens that are initialized to zero, preserving continuity with the previously trained model. This approach, shown in below Figure allows the model to grow by progressively appending key-value pairs:

we augment this set by appending new key-value parameter tokens

Tokenformer is a fully attention-driven architecture featuring a new token-Parameter attention (Pattention) layer. The Pattention uses a set of learnable tokens to represent model parameters and lets the input tokens attend to them. As the model scales, Tokenformer adds new learnable tokens to expand the existing key-value parameter sets while keeping the feature dimension constant and leaving the rest of the computation unaffected.

This design enables the model to reuse previously learned parameters and prevents model disruption, resulting in faster convergence.

领英推荐

Complex Event Processing (CEP)

Macrometa 2 年前

Unlocking the Power of CAP and PACELC Theorems

Ashish Joshi 12 个月前

Modular Monolith Architecture with .NET

Volosoft 3 个月前

?? Performance and Results

1. Efficiency and Cost Savings

TokenFormer significantly reduces training costs by reusing parameters from smaller versions. As shown in below Table , TokenFormer with parameter reuse achieved performance levels on par with or better than scratch-trained Transformers with similar parameter sizes, achieving lower perplexity on the OpenWebText dataset.

The perplexity of models trained with different numbers of schemes is compared. Transformers are trained from scratch, while Tokenformer are progressively scaled up via parameter resuing. When trained with the same number of tokens (30B), Tokenformer demonstrates superior performance.

2. Model Scaling Cost Comparison

Below Figure compares cumulative scaling costs between TokenFormer and traditional Transformers. TokenFormer aggregates costs across scaling stages, allowing for substantial training cost reductions compared to Transformers trained independently at each size. At the 1.4B parameter level, TokenFormer reached a perplexity of 11.77 compared to 11.63 for a scratch-trained Transformer, but with only one-tenth of the training cost.

Evaluating model scaling costs through cumulative computational budgets. The Transformer baseline incurs expenses for each individual scaling step performed independently from scratch, whereas Tokenformer aggregates costs across all scaling stages, including training a 124M model initially, progressively scaling to 354M, 757M, and 1.4B parameters

3. Zero-Shot Task Performance

TokenFormer’s zero-shot performance on NLP tasks, displayed in the?table below, rivals that of larger models, achieving comparable accuracy across tasks like?LAMBADA?and?PIQA, demonstrating its scalability and efficiency.

The best performance for each model size is highlighted in bold. Our comparisons are made with publicly available transformer-based LMs with various tokenizers.

?? Visual Model Comparison

In the?figure below, TokenFormer’s architecture is illustrated with a clear comparison to traditional Transformer structures, showcasing how the?tokenized parameter approach?enables the model to scale flexibly. By merging token-token and token-parameter interactions, TokenFormer reduces reliance on linear projections and opens up new avenues for resource-efficient Transformer scaling.

Traditionally, large transformer architectures are trained from scratch without reusing previous smaller-scale models (represented by blue dots on the left). In this paper, we propose a novel fully attention-based architecture that allows scaling model incrementally, thus greatly reducing the overall cost of training large transformer architectures (depicted by red dots on the left). The right panel delineates a comparison between conventional Transformer and our Tokenformer.

?? Credits and Further Reading

This work is attributed to Haiyang Wang, Yue Fan, Muhammad Ferjad Naeem, Yongqin Xian, Jan Eric Lenssen, Liwei Wang, Federico Tombari, and Bernt Schiele. For a deeper dive, access the full research paper and code here.

要查看或添加评论，请登录

Navdeet Saini的更多文章

?? Exciting Advances in AI: Exploring Cache-Augmented Generation (CAG) ??

2025年1月11日

?? Exciting Advances in AI: Exploring Cache-Augmented Generation (CAG) ??

I am thrilled to share insights from a groundbreaking research paper titled "Don’t Do RAG: When Cache-Augmented…
?? Smarter, Better, Faster, Longer: A New Era of Encoder Models with ModernBERT ??

2025年1月1日

?? Smarter, Better, Faster, Longer: A New Era of Encoder Models with ModernBERT ??

In the evolving landscape of NLP, encoder-only Transformer models like BERT continue to play a pivotal role, especially…
?? LLaVA-o1: Pioneering Step-by-Step Reasoning in Vision-Language Models ??

2024年12月23日

?? LLaVA-o1: Pioneering Step-by-Step Reasoning in Vision-Language Models ??

The fusion of vision and language capabilities in AI is transforming how machines interpret and reason about the world.…
?? Revolutionizing Transformer Training with Dynamic Dropout ??

2024年12月18日

?? Revolutionizing Transformer Training with Dynamic Dropout ??

Training large-scale Transformer models has always been computationally expensive, often requiring careful trade-offs…
?? Selective Attention in Transformers: Maximizing Efficiency and Precision ??

2024年11月5日

?? Selective Attention in Transformers: Maximizing Efficiency and Precision ??

Transformers are the backbone of modern NLP, yet they face challenges in memory use and computation overhead. This…
?? Unlocking Seamless Video-to-Music Generation with MUVI: Deep Semantic and Rhythmic Synchronization ??

2024年11月1日

?? Unlocking Seamless Video-to-Music Generation with MUVI: Deep Semantic and Rhythmic Synchronization ??

?? Unlocking Seamless Video-to-Music Generation with MUVI: Deep Semantic and Rhythmic Synchronization ?? Creating music…
?? Transforming Long-Context Embeddings: Unpacking Late Chunking for Better Retrieval Accuracy ??

2024年10月29日

?? Transforming Long-Context Embeddings: Unpacking Late Chunking for Better Retrieval Accuracy ??

As AI developers, we know the challenge of embedding lengthy documents without losing context. Traditional chunking…
?? Unleashing the Power of Normalization in Transformers: Introducing the Normalized GPT (nGPT) with Hypersphere Learning! ??

2024年10月24日

?? Unleashing the Power of Normalization in Transformers: Introducing the Normalized GPT (nGPT) with Hypersphere Learning! ??

As an AI developer, we know that while Transformers have reshaped natural language processing, they still struggle with…
?? Delving into the Limits of Mathematical Reasoning in LLMs: A Deep Dive with GSM-Symbolic ??

2024年10月21日

?? Delving into the Limits of Mathematical Reasoning in LLMs: A Deep Dive with GSM-Symbolic ??

As AI developers, we know how revolutionary Large Language Models (LLMs) like GPT-4 and others have been across various…

1 条评论
?? Boosting AI Performance by Optimizing Compute at Test-Time!

2024年10月20日

?? Boosting AI Performance by Optimizing Compute at Test-Time!

1. Motivation and Problem Setup The core motivation of this research stems from the need to optimize the use of…

See all articles

?? Introducing TokenFormer: Redefining Efficient Scaling in Transformer Architectures ??

Navdeet Saini

Just a Regular Human Trying to Teach Machines to Think... What Could Go Wrong? ??? || Data Scientist || Gen-AI Aficionado || AI/ML Engineer || NLP Enthusiast || Research Analyst || MTech || PEC, Chandigarh

?? What is TokenFormer?

?? How TokenFormer Works

1. Token-Parameter Attention (PAttention) Mechanism

2. Incremental Scaling through Zero Initialization

领英推荐

?? Performance and Results

1. Efficiency and Cost Savings

2. Model Scaling Cost Comparison

3. Zero-Shot Task Performance

?? Visual Model Comparison

?? Credits and Further Reading

Navdeet Saini的更多文章

社区洞察

其他会员也浏览了

Decoding the Use of Event-driven Architecture in Various Industries

Hyperledger Fabric -Practices

Negative Time to Resolution; Preventing Outages Before They Happen

M-Plane architecture model in ORAN: Hierarchical and Hybrid Models

Latency vs Throughput: Deep Dive in System Design part -3

Triggered! An Architect’s Musings on Event-driven Architecture (2 of 2)

The Iceberg Principle - Avoiding the hidden danger.

Revitalizing Knowledge Architecture

Simplifying Conceptual Flow Interfaces Through DDD

Cell-Based Architecture — Architecture Pattern

?? What is TokenFormer?

?? How TokenFormer Works

1. Token-Parameter Attention (PAttention) Mechanism

2. Incremental Scaling through Zero Initialization

领英推荐

?? Performance and Results

1. Efficiency and Cost Savings

2. Model Scaling Cost Comparison

3. Zero-Shot Task Performance

?? Visual Model Comparison

?? Credits and Further Reading

Navdeet Saini的更多文章

?? Exciting Advances in AI: Exploring Cache-Augmented Generation (CAG) ??

?? Smarter, Better, Faster, Longer: A New Era of Encoder Models with ModernBERT ??

?? LLaVA-o1: Pioneering Step-by-Step Reasoning in Vision-Language Models ??

?? Revolutionizing Transformer Training with Dynamic Dropout ??

?? Selective Attention in Transformers: Maximizing Efficiency and Precision ??

?? Unlocking Seamless Video-to-Music Generation with MUVI: Deep Semantic and Rhythmic Synchronization ??

?? Transforming Long-Context Embeddings: Unpacking Late Chunking for Better Retrieval Accuracy ??

?? Unleashing the Power of Normalization in Transformers: Introducing the Normalized GPT (nGPT) with Hypersphere Learning! ??

?? Delving into the Limits of Mathematical Reasoning in LLMs: A Deep Dive with GSM-Symbolic ??

?? Boosting AI Performance by Optimizing Compute at Test-Time!

社区洞察

其他会员也浏览了

Decoding the Use of Event-driven Architecture in Various Industries

Hyperledger Fabric -Practices

Negative Time to Resolution; Preventing Outages Before They Happen

M-Plane architecture model in ORAN: Hierarchical and Hybrid Models

Latency vs Throughput: Deep Dive in System Design part -3

Triggered! An Architect’s Musings on Event-driven Architecture (2 of 2)

The Iceberg Principle - Avoiding the hidden danger.

Revitalizing Knowledge Architecture

Simplifying Conceptual Flow Interfaces Through DDD

Cell-Based Architecture — Architecture Pattern