Paper Review: RWKV-7 “Goose†with Expressive Dynamic State Evolution
Andrey Lukyanenko
Data Scientist / Machine Learning Engineer. Kaggle Competition Master, Notebooks Top-1.
RWKV-7 “Goose†is a new sequence modeling architecture that achieves state-of-the-art performance on multilingual tasks at the 3B parameter scale and matches top English models, despite using far fewer training tokens. It requires constant memory and inference time per token. The architecture introduces a generalized delta rule with vector gating and adaptive learning rates, as well as a relaxed value replacement rule. RWKV-7 can handle state tracking, recognize all regular languages, and is parallelizable during training - surpassing Transformer capabilities under standard complexity limits
Background information on delta rule
Linear attention is more efficient than softmax attention, offering constant time and memory per token, but it struggles with state accumulation: old information is never fully removed, leading to degraded output over time. Modern architectures like RWKV-6 and Mamba 2 address this using decay, but decay can’t selectively remove specific values.
DeltaNet solves this by using the Delta Rule, which updates memory on a per-key basis—replacing old values with new ones using an error-correcting mechanism similar to online learning. It treats memory updates as a form of stochastic gradient descent, improving retrieval accuracy while avoiding uncontrolled state growth.
Architecture
RWKV-7 develops a more expressive version of the delta update rule by generalizing it into a diagonal plus rank one update. This new formulation enhances the model’s ability to update its internal state in a flexible and efficient manner. Unlike earlier models that used a fixed scalar decay, RWKV-7 uses a vector-valued, data-dependent decay, which allows each channel in the state to evolve independently. It also separates the concepts of removal and replacement keys, resulting in more precise control over which parts of the state are updated or overwritten.
While previous models with diagonal transition matrices were limited to representing functions in the TC? complexity class, RWKV-7 can represent more complex functions and is proven to recognize all regular languages. A critical improvement allowing this is its ability to perform “copy†state transitions, a feature not supported by earlier delta rule implementations.
RWKV-7 also makes architectural improvements over RWKV-6. It replaces the previous diagonal transition matrix with the extended delta rule and simplifies the token shift and channel mixing modules. By removing the data dependency from token shift and simplifying the gating in channel mixing, RWKV-7 achieves faster training and inference. Additionally, it increases the use of low-rank projections to generate intermediate computations more efficiently, striking a balance between parameter count, speed, and downstream performance. These enhancements collectively contribute to RWKV-7’s improved modeling capability and efficiency.
The approach
RWKV-7 extends the use of low-rank MLPs to efficiently compute key parameters.
The Weighted Key-Value state is updated across time using a recurrent formula that combines decay, selective forgetting, and new value insertion. This is done per attention head. The transition matrix is a scaled approximation of a Householder matrix, allowing flexible state dynamics with eigenvalues in a stable range. The in-context learning rate and decay and serve as generalized mechanisms for memory writing and forgetting.
Once the WKV state is updated, the receptance vector acts similarly to a Transformer query and is used to extract information from the state. It lets the model directly attend to the current token input without requiring it to be stored in the state. The resulting output is normalized, combined across heads, gated, and passed through a linear layer to produce the final model output.
The MLP module of RWKV-7 differs from the previous RWKV architecture: the authors remove the gating matrix and have only a two-layer MLP. To compensate for the decreased parameter count, the hidden dimension is set to be 4 times larger than the model dimension.
Results
RWKV-7 models were evaluated using the LM Evaluation Harness on a wide range of English and multilingual benchmarks. Despite being trained on significantly fewer tokens, RWKV-7 matches the English performance of Qwen2.5 and shows major improvements over RWKV-6, especially in MMLU. The multilingual version, RWKV-7-World, outperforms other state-of-the-art models, while using far fewer training FLOPs.
Additionally, RWKV-7 was tested on temporally novel internet data, including recent arXiv papers, GitHub code, Wikipedia entries, fiction, and news—ensuring no training data leakage. Using compression rate as a metric, RWKV-7 Goose demonstrated competitive performance, reinforcing its generalization ability even with less training data.
Speed
RWKV-7 kernels scale linearly with sequence length, while Flash Attention v3 scales quadratically. As a result, RWKV-7 becomes significantly faster for long sequences—up to 3x faster than RWKV-6 and much faster than Flash Attention v3 at large sequence lengths (16k tokens).
Multimodal Experiments
VisualRWKV-7 uses SigLIP, DINO, and a high-resolution SAM encoder.
When trained on the same dataset as LLaVA-1.5 (558k alignment samples and 665k SFT samples), VisualRWKV-7 achieves significantly better performance than VisualRWKV-6, despite using fewer parameters.