?? Introducing TokenFormer: Redefining Efficient Scaling in Transformer Architectures ??
Navdeet Saini
Just a Regular Human Trying to Teach Machines to Think... What Could Go Wrong? ??? || Data Scientist || Gen-AI Aficionado || AI/ML Engineer || NLP Enthusiast || Research Analyst || MTech || PEC, Chandigarh
Scaling large Transformer models has always required high computational resources, making it challenging for practical deployment. This paper, "TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters," presents a revolutionary approach to scaling Transformers without retraining from scratch, enabling flexible, efficient, and reusable architectures. Here’s how TokenFormer achieves this and why it’s a game-changer in Transformer scaling!
?? What is TokenFormer?
TokenFormer introduces a Token-Parameter Attention (PAttention) layer where model parameters are treated as tokens in the same way as input data. This design reformulates all linear projections in Transformers to cross-attention layers, allowing input tokens to interact with model parameters more flexibly. This shift enables the model to grow incrementally, scaling efficiently from 124M to 1.4B parameters without retraining, reducing costs and improving scalability.
?? How TokenFormer Works
1. Token-Parameter Attention (PAttention) Mechanism
TokenFormer employs Token-Parameter Attention (PAttention), where model parameters are structured as tokens that the input tokens can interact with directly. By treating parameters as tokens, all linear projections in the Transformer are reformulated into cross-attention layers, allowing the model to grow without needing to retrain from scratch:
This adjustment provides a structured way to introduce new parameters incrementally, allowing TokenFormer to scale from 124M to 1.4B parameters.
2. Incremental Scaling through Zero Initialization
TokenFormer expands by adding new parameter tokens that are initialized to zero, preserving continuity with the previously trained model. This approach, shown in below Figure allows the model to grow by progressively appending key-value pairs:
This design enables the model to reuse previously learned parameters and prevents model disruption, resulting in faster convergence.
领英推荐
?? Performance and Results
1. Efficiency and Cost Savings
TokenFormer significantly reduces training costs by reusing parameters from smaller versions. As shown in below Table , TokenFormer with parameter reuse achieved performance levels on par with or better than scratch-trained Transformers with similar parameter sizes, achieving lower perplexity on the OpenWebText dataset.
2. Model Scaling Cost Comparison
Below Figure compares cumulative scaling costs between TokenFormer and traditional Transformers. TokenFormer aggregates costs across scaling stages, allowing for substantial training cost reductions compared to Transformers trained independently at each size. At the 1.4B parameter level, TokenFormer reached a perplexity of 11.77 compared to 11.63 for a scratch-trained Transformer, but with only one-tenth of the training cost.
3. Zero-Shot Task Performance
TokenFormer’s zero-shot performance on NLP tasks, displayed in the?table below, rivals that of larger models, achieving comparable accuracy across tasks like?LAMBADA?and?PIQA, demonstrating its scalability and efficiency.
?? Visual Model Comparison
In the?figure below, TokenFormer’s architecture is illustrated with a clear comparison to traditional Transformer structures, showcasing how the?tokenized parameter approach?enables the model to scale flexibly. By merging token-token and token-parameter interactions, TokenFormer reduces reliance on linear projections and opens up new avenues for resource-efficient Transformer scaling.
?? Credits and Further Reading
This work is attributed to Haiyang Wang, Yue Fan, Muhammad Ferjad Naeem, Yongqin Xian, Jan Eric Lenssen, Liwei Wang, Federico Tombari, and Bernt Schiele. For a deeper dive, access the full research paper and code here.