登录查看更多内容

Adaptive LLM Transformer2

Boyang Zhou

AI Architect at Microsoft

发布日期: 2025年2月3日

I came across an interesting paper titled "TRANSFORMER-SQUARED: SELF-ADAPTIVE LLMS" (link). It was published by SakanaAI, a Japanese AI company. (I have previously written an article about their method for enhancing model capabilities without training—feel free to check it out if you're interested.) SakanaAI has Lion Jones on their team, who is also one of the authors of Attention Is All You Need. This paper continues their tradition of focusing on algorithmic innovations rather than relying on extensive computing power (as of mid-2024, they reportedly acquired their first 8x H100 GPUs). Their approach is highly creative.

The paper primarily introduces a novel fine-tuning method called SVF (Singular Value Fine-tuning) to address challenges in traditional Supervised Fine-tuning (SFT), particularly those based on LoRA (Low-Rank Adaptation). The main issue with traditional SFT, including LoRA, is its inability to clearly distinguish downstream tasks. Additionally, when injecting new knowledge into the model, modifications to the original weight matrix can inadvertently affect performance on other tasks.

To mitigate these issues, the paper proposes using Singular Value Decomposition (SVD). SVD decomposes a matrix into three smaller matrices such that:

W=UΣV?W = UΣV^?W=UΣV?

where:

U∈Rm×rU \in R^{m \times r}U∈Rm×r and V∈Rn×rV \in R^{n \times r}V∈Rn×r are semi-orthogonal matrices (conceptually similar to LoRA's low-rank decomposition),
Σ∈Rr×rΣ \in R^{r \times r}Σ∈Rr×r is a diagonal matrix, with singular values of WWW on its diagonal,
Each singular value σiσ?σi represents the contribution of the corresponding singular vector pair (ui,vi)(u?, v?)(ui,vi) to the output.

The purpose of this decomposition is to facilitate SVF (Singular Value Fine-tuning).

What is SVF?

Instead of directly modifying the weight matrix W, SVF learns a vector z∈Rrz \in R^rz∈Rr, which is then used to adjust the singular values of W, modifying its behavior in a structured manner.

For each weight matrix WWW, SVF learns a vector zzz, which independently adjusts each singular component, producing a new weight matrix:

W′=UΣ′V?W' = UΣ'V^?W′=UΣ′V?

where:

Σ′=Σ?diag(z)Σ' = Σ ? \text{diag}(z)Σ′=Σ?diag(z)

领英推荐

DeciDiffusion 1.0: 3x the Speed of Stable Diffusion…

Deci AI (Acquired by NVIDIA) 1 年前

Latest Updates: FREE Llama 3.2 Multimodal & FLUX.1…

Together AI 4 个月前

The 5th edition of the ITU AI/ML in 5G Challenge: A…

AI for Good 1 个月前

Here, diag(zzz) is a diagonal matrix where the diagonal elements are given by zzz.

Why is this approach effective?

Rather than directly altering W, SVF enables fine-grained control by scaling singular values. This technique allows optimization via reinforcement learning (RL), tuning parameters based on task performance without requiring massive datasets with explicit task explanations.

Intuition Behind SVF:

Simply put, SVF "splits" the weight matrix WWW into finer components. Within W, certain values may control mathematical reasoning, others may handle language understanding, and some may be responsible for historical knowledge.

During training, SVF learns a set of vectors zzz, where each downstream task corresponds to a specific zzz vector. Since ΣΣΣ can be computed from zzz, it essentially acts as a signal amplifier. For instance:

When training on a language task, zzz might be [0,1,0.7],
For a math task, zzz could be [1,0.5,0].

SVF utilizes reinforcement learning (RL) to learn these zzz vectors across a predefined set of downstream tasks.

How does it work in inference?

Once trained, the inference process proceeds as follows:

The system analyzes the prompt to determine the nature of the task (e.g., history-related).
It retrieves the corresponding zzz vector for that task.
Using zzz + the base network, it performs inference.

Final Thoughts

This is an ingenious idea, and its effectiveness scales with model size—the larger the model, the better the results.

要查看或添加评论，请登录

Boyang Zhou的更多文章

你理解的控制LLM幻觉的方法可能是错的

2025年2月23日

你理解的控制LLM幻觉的方法可能是错的

那什么是粗暴的控制LLM的幻觉的方法呢？正常你们大家学到的应该是 temperature=0 top_k=1 top_p=0.1 类似这种的但是这种是不是能解决幻觉呢？很显然在做的各位试过，应该是没什么效果的为什么呢?…
SSI用量子计算来玩AI

2025年2月22日

SSI用量子计算来玩AI

SSI用量子计算来玩AI 刚到家，早上说今天回来要写SSI为什么这么牛B，那就必须得写 SSI是什么公司？ Safe Super Intelligence 就是中间这个秃子的公司 ilya 前openAI 首席科学家(现在的mark…
强化学习能让小模型多恐怖？

2025年2月19日

强化学习能让小模型多恐怖？

不是标题党！不是标题党！不是标题党！先说3遍这个模型有多大呢？1.5B，相当于鼻涕嘎一般大小，和大模型可以说是毫无关系先看看它和别的模型比较，我们不能只看eval…
快速讲一下deepseek的新论文，这次他们魔爪伸向了attention

2025年2月18日

快速讲一下deepseek的新论文，这次他们魔爪伸向了attention

快速讲一下deepseek的新论文，这次他们魔爪伸向了attention 新文论地址： https://arxiv.org/pdf/2502.
LLM Math?

2025年2月13日

LLM Math?

At the beginning of the year, I predicted that reasoning would become a hot topic, but I didn’t expect it to blow up so…
LLM到底会解数学题吗？

2025年2月13日

LLM到底会解数学题吗？

我开年时候说Reasoning会火，但是我没想到火这么快和2023年预测MOE一样，2024年年底预测reasoning基本也是年度AI热词了，我跟迪亚波罗的绯红之王有一拼但是确实最近没什么东西值得写，也收到了兄弟的吐槽…
All in one 的 AI tool Chain “Halomate”

2025年2月5日

All in one 的 AI tool Chain “Halomate”

这不算广告啊，就是真好用，虽然是我哥们儿的产品比如你定了个gpt的plus 订阅，你发现好像有挺多功能 1- chat，这个自不必说，必须的功能 2- 高级语音现在变成学英语的了，实时视频也就是我过年给姑婶介绍是是ai用的 3-…
产品思维的角度来讲，Deep Research本质是Co-RAG

2025年2月4日

产品思维的角度来讲，Deep Research本质是Co-RAG

当然我这个标题扣的很多同学会不同意也能理解比如有些人说我用while 也能实现只要最终给出一个差不多样子的markdown文件就行这话也对也不对对的是似乎从产出物来讲，是那么回事，但是实际上你的东西不一定是deep…
CoRAG: A New Paradigm for Retrieval-Augmented Generation in Deep Research

2025年2月4日

CoRAG: A New Paradigm for Retrieval-Augmented Generation in Deep Research

Introduction When discussing advanced problem-solving using large language models (LLMs), a recurring topic is the role…
Adaptive LLM Transformer2

2025年2月3日

Adaptive LLM Transformer2

看到了一个不错的论文https://arxiv.org/pdf/2501.

See all articles

Adaptive LLM Transformer2

Boyang Zhou

AI Architect at Microsoft

What is SVF?

领英推荐

Why is this approach effective?

Intuition Behind SVF:

How does it work in inference?

Final Thoughts

Boyang Zhou的更多文章

社区洞察

其他会员也浏览了

How YOLO Models Have Evolved: A Journey Through Object Detection Innovation

Optimizing the T5 Model for Fast Inference

The Role of M.2 Connectors in Advancing AI Accelerators and SSDs

Instruction Pretraining LLMs

Unlocking Breakthrough AI Performance with GPU Expansion Systems

NVIDIA's NVILA VLM with "scale-then-compress" approach

AI Acceleration in Every Core, 5th Gen Intel? Xeon?

Observations on the first order outputs of LLM’s wrt NVIDIA DGX Reference Architecture employing ChatGPT and Claude – an outside in perspective

Tech News: SK Hynix First to Mass Produce 12-Layer Stacked HBM3E

‘Onforand’: the AI-RAN confluence with NVIDIA and it’s 6G developer’s forum, an outside in perspective on the initiatives to Xform the Telco Industry

What is SVF?

领英推荐

Why is this approach effective?

Intuition Behind SVF:

How does it work in inference?

Final Thoughts

Boyang Zhou的更多文章

你理解的控制LLM幻觉的方法可能是错的

SSI用量子计算来玩AI

强化学习能让小模型多恐怖？

快速讲一下deepseek的新论文，这次他们魔爪伸向了attention

LLM Math?

LLM到底会解数学题吗？

All in one 的 AI tool Chain “Halomate”

产品思维的角度来讲，Deep Research本质是Co-RAG

CoRAG: A New Paradigm for Retrieval-Augmented Generation in Deep Research