Adaptive LLM Transformer2

I came across an interesting paper titled "TRANSFORMER-SQUARED: SELF-ADAPTIVE LLMS" (link). It was published by SakanaAI, a Japanese AI company. (I have previously written an article about their method for enhancing model capabilities without training—feel free to check it out if you're interested.) SakanaAI has Lion Jones on their team, who is also one of the authors of Attention Is All You Need. This paper continues their tradition of focusing on algorithmic innovations rather than relying on extensive computing power (as of mid-2024, they reportedly acquired their first 8x H100 GPUs). Their approach is highly creative.

The paper primarily introduces a novel fine-tuning method called SVF (Singular Value Fine-tuning) to address challenges in traditional Supervised Fine-tuning (SFT), particularly those based on LoRA (Low-Rank Adaptation). The main issue with traditional SFT, including LoRA, is its inability to clearly distinguish downstream tasks. Additionally, when injecting new knowledge into the model, modifications to the original weight matrix can inadvertently affect performance on other tasks.

To mitigate these issues, the paper proposes using Singular Value Decomposition (SVD). SVD decomposes a matrix into three smaller matrices such that:

W=UΣV?W = UΣV^?W=UΣV?


where:

  • U∈Rm×rU \in R^{m \times r}U∈Rm×r and V∈Rn×rV \in R^{n \times r}V∈Rn×r are semi-orthogonal matrices (conceptually similar to LoRA's low-rank decomposition),
  • Σ∈Rr×rΣ \in R^{r \times r}Σ∈Rr×r is a diagonal matrix, with singular values of WWW on its diagonal,
  • Each singular value σiσ?σi represents the contribution of the corresponding singular vector pair (ui,vi)(u?, v?)(ui,vi) to the output.

The purpose of this decomposition is to facilitate SVF (Singular Value Fine-tuning).


What is SVF?

Instead of directly modifying the weight matrix W, SVF learns a vector z∈Rrz \in R^rz∈Rr, which is then used to adjust the singular values of W, modifying its behavior in a structured manner.

For each weight matrix WWW, SVF learns a vector zzz, which independently adjusts each singular component, producing a new weight matrix:

W′=UΣ′V?W' = UΣ'V^?W′=UΣ′V?

where:

Σ′=Σ?diag(z)Σ' = Σ ? \text{diag}(z)Σ′=Σ?diag(z)

Here, diag(zzz) is a diagonal matrix where the diagonal elements are given by zzz.

Why is this approach effective?

Rather than directly altering W, SVF enables fine-grained control by scaling singular values. This technique allows optimization via reinforcement learning (RL), tuning parameters based on task performance without requiring massive datasets with explicit task explanations.

Intuition Behind SVF:

Simply put, SVF "splits" the weight matrix WWW into finer components. Within W, certain values may control mathematical reasoning, others may handle language understanding, and some may be responsible for historical knowledge.

During training, SVF learns a set of vectors zzz, where each downstream task corresponds to a specific zzz vector. Since ΣΣΣ can be computed from zzz, it essentially acts as a signal amplifier. For instance:

  • When training on a language task, zzz might be [0,1,0.7],
  • For a math task, zzz could be [1,0.5,0].

SVF utilizes reinforcement learning (RL) to learn these zzz vectors across a predefined set of downstream tasks.

How does it work in inference?

Once trained, the inference process proceeds as follows:

  1. The system analyzes the prompt to determine the nature of the task (e.g., history-related).
  2. It retrieves the corresponding zzz vector for that task.
  3. Using zzz + the base network, it performs inference.

Final Thoughts

This is an ingenious idea, and its effectiveness scales with model size—the larger the model, the better the results.



要查看或添加评论,请登录

Boyang Zhou的更多文章

  • 你理解的控制LLM幻觉的方法可能是错的

    你理解的控制LLM幻觉的方法可能是错的

    那什么是粗暴的控制LLM的幻觉的方法呢? 正常你们大家学到的应该是 temperature=0 top_k=1 top_p=0.1 类似这种的 但是这种是不是能解决幻觉呢?很显然在做的各位试过,应该是没什么效果的 为什么呢?…

  • SSI用量子计算来玩AI

    SSI用量子计算来玩AI

    SSI用量子计算来玩AI 刚到家,早上说今天回来要写SSI为什么这么牛B,那就必须得写 SSI是什么公司? Safe Super Intelligence 就是中间这个秃子的公司 ilya 前openAI 首席科学家(现在的mark…

  • 强化学习能让小模型多恐怖?

    强化学习能让小模型多恐怖?

    不是标题党! 不是标题党! 不是标题党! 先说3遍 这个模型有多大呢?1.5B,相当于鼻涕嘎一般大小,和大模型可以说是毫无关系 先看看它和别的模型比较,我们不能只看eval…

  • 快速讲一下deepseek的新论文,这次他们魔爪伸向了attention

    快速讲一下deepseek的新论文,这次他们魔爪伸向了attention

    快速讲一下deepseek的新论文,这次他们魔爪伸向了attention 新文论地址: https://arxiv.org/pdf/2502.

  • LLM Math?

    LLM Math?

    At the beginning of the year, I predicted that reasoning would become a hot topic, but I didn’t expect it to blow up so…

  • LLM到底会解数学题吗?

    LLM到底会解数学题吗?

    我开年时候说Reasoning会火,但是我没想到火这么快 和2023年预测MOE一样,2024年年底预测reasoning基本也是年度AI热词了,我跟迪亚波罗的绯红之王有一拼 但是确实最近没什么东西值得写,也收到了兄弟的吐槽…

  • All in one 的 AI tool Chain “Halomate”

    All in one 的 AI tool Chain “Halomate”

    这不算广告啊,就是真好用,虽然是我哥们儿的产品 比如你定了个gpt的plus 订阅,你发现好像有挺多功能 1- chat,这个自不必说,必须的功能 2- 高级语音 现在变成学英语的了,实时视频也就是我过年给姑婶介绍是是ai用的 3-…

  • 产品思维的角度来讲,Deep Research本质是Co-RAG

    产品思维的角度来讲,Deep Research本质是Co-RAG

    当然我这个标题扣的很多同学会不同意 也能理解 比如有些人说我用while 也能实现只要最终给出一个差不多样子的markdown文件就行 这话也对 也不对 对的是似乎从产出物来讲,是那么回事,但是实际上你的东西不一定是deep…

  • CoRAG: A New Paradigm for Retrieval-Augmented Generation in Deep Research

    CoRAG: A New Paradigm for Retrieval-Augmented Generation in Deep Research

    Introduction When discussing advanced problem-solving using large language models (LLMs), a recurring topic is the role…

  • Adaptive LLM Transformer2

    Adaptive LLM Transformer2

    看到了一个不错的论文https://arxiv.org/pdf/2501.

社区洞察

其他会员也浏览了