ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Paper Review: RWKV-7 â€œGooseâ€ with Expressive Dynamic State Evolution

Andrey Lukyanenko

Data Scientist / Machine Learning Engineer. Kaggle Competition Master, Notebooks Top-1.

å‘å¸ƒæ—¥æœŸ: 2025å¹´3æœˆ24æ—¥

+ å…³æ³¨

Paper

Code

Project

RWKV-7 â€œGooseâ€ is a new sequence modeling architecture that achieves state-of-the-art performance on multilingual tasks at the 3B parameter scale and matches top English models, despite using far fewer training tokens. It requires constant memory and inference time per token. The architecture introduces a generalized delta rule with vector gating and adaptive learning rates, as well as a relaxed value replacement rule. RWKV-7 can handle state tracking, recognize all regular languages, and is parallelizable during training - surpassing Transformer capabilities under standard complexity limits

Background information on delta rule

Linear attention is more efficient than softmax attention, offering constant time and memory per token, but it struggles with state accumulation: old information is never fully removed, leading to degraded output over time. Modern architectures like RWKV-6 and Mamba 2 address this using decay, but decay canâ€™t selectively remove specific values.

DeltaNet solves this by using the Delta Rule, which updates memory on a per-key basisâ€”replacing old values with new ones using an error-correcting mechanism similar to online learning. It treats memory updates as a form of stochastic gradient descent, improving retrieval accuracy while avoiding uncontrolled state growth.

Architecture

RWKV-7 develops a more expressive version of the delta update rule by generalizing it into a diagonal plus rank one update. This new formulation enhances the modelâ€™s ability to update its internal state in a flexible and efficient manner. Unlike earlier models that used a fixed scalar decay, RWKV-7 uses a vector-valued, data-dependent decay, which allows each channel in the state to evolve independently. It also separates the concepts of removal and replacement keys, resulting in more precise control over which parts of the state are updated or overwritten.

While previous models with diagonal transition matrices were limited to representing functions in the TC? complexity class, RWKV-7 can represent more complex functions and is proven to recognize all regular languages. A critical improvement allowing this is its ability to perform â€œcopyâ€ state transitions, a feature not supported by earlier delta rule implementations.

RWKV-7 also makes architectural improvements over RWKV-6. It replaces the previous diagonal transition matrix with the extended delta rule and simplifies the token shift and channel mixing modules. By removing the data dependency from token shift and simplifying the gating in channel mixing, RWKV-7 achieves faster training and inference. Additionally, it increases the use of low-rank projections to generate intermediate computations more efficiently, striking a balance between parameter count, speed, and downstream performance. These enhancements collectively contribute to RWKV-7â€™s improved modeling capability and efficiency.

The approach

RWKV-7 extends the use of low-rank MLPs to efficiently compute key parameters.

The Weighted Key-Value state is updated across time using a recurrent formula that combines decay, selective forgetting, and new value insertion. This is done per attention head. The transition matrix is a scaled approximation of a Householder matrix, allowing flexible state dynamics with eigenvalues in a stable range. The in-context learning rate and decay and serve as generalized mechanisms for memory writing and forgetting.

Once the WKV state is updated, the receptance vector acts similarly to a Transformer query and is used to extract information from the state. It lets the model directly attend to the current token input without requiring it to be stored in the state. The resulting output is normalized, combined across heads, gated, and passed through a linear layer to produce the final model output.

The MLP module of RWKV-7 differs from the previous RWKV architecture: the authors remove the gating matrix and have only a two-layer MLP. To compensate for the decreased parameter count, the hidden dimension is set to be 4 times larger than the model dimension.

Results

RWKV-7 models were evaluated using the LM Evaluation Harness on a wide range of English and multilingual benchmarks. Despite being trained on significantly fewer tokens, RWKV-7 matches the English performance of Qwen2.5 and shows major improvements over RWKV-6, especially in MMLU. The multilingual version, RWKV-7-World, outperforms other state-of-the-art models, while using far fewer training FLOPs.

Additionally, RWKV-7 was tested on temporally novel internet data, including recent arXiv papers, GitHub code, Wikipedia entries, fiction, and newsâ€”ensuring no training data leakage. Using compression rate as a metric, RWKV-7 Goose demonstrated competitive performance, reinforcing its generalization ability even with less training data.

Speed

RWKV-7 kernels scale linearly with sequence length, while Flash Attention v3 scales quadratically. As a result, RWKV-7 becomes significantly faster for long sequencesâ€”up to 3x faster than RWKV-6 and much faster than Flash Attention v3 at large sequence lengths (16k tokens).

Multimodal Experiments

VisualRWKV-7 uses SigLIP, DINO, and a high-resolution SAM encoder.

When trained on the same dataset as LLaVA-1.5 (558k alignment samples and 665k SFT samples), VisualRWKV-7 achieves significantly better performance than VisualRWKV-6, despite using fewer parameters.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Andrey Lukyanenkoçš„æ›´å¤šæ–‡ç«

Paper Review: Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

2025å¹´3æœˆ17æ—¥

Paper Review: Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Paper Code Project Audio Flamingo 2 is an Audio-Language Model with advanced audio understanding and reasoningâ€¦

2 æ¡è¯„è®º
Paper Review: Large Language Diffusion Models

2025å¹´3æœˆ10æ—¥

Paper Review: Large Language Diffusion Models

Paper Code Project LLaDA is a diffusion-based alternative to autoregressive models for LLMs. It models distributionsâ€¦
Paper Review: NeoBERT: A Next-Generation BERT

2025å¹´3æœˆ3æ—¥

Paper Review: NeoBERT: A Next-Generation BERT

Paper Code NeoBERT is a next-generation bidirectional encoder; it incorporates state-of-the-art architecturalâ€¦
Paper Review: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

2025å¹´2æœˆ24æ—¥

Paper Review: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper Project SigLIP 2 is a new family of multilingual vision-language encoders that improve upon the original SigLIPâ€¦
Paper Review: Goku: Flow Based Video Generative Foundation Models

2025å¹´2æœˆ17æ—¥

Paper Review: Goku: Flow Based Video Generative Foundation Models

Paper Code Project Goku is a family of joint image-and-video generation models built on rectified flow Transformersâ€¦

1 æ¡è¯„è®º
Paper Review: Titans: Learning to Memorize at Test Time

2025å¹´2æœˆ3æ—¥

Paper Review: Titans: Learning to Memorize at Test Time

Paper Titans is a new neural architecture that combines attention mechanisms with a long-term memory module toâ€¦

6 æ¡è¯„è®º
Paper Review: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

2025å¹´1æœˆ27æ—¥

Paper Review: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper Project Hugging Face page Code The DeepSeek team introduces two reasoning models, DeepSeek-R1-Zero andâ€¦
Paper Review: STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

2025å¹´1æœˆ13æ—¥

Paper Review: STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Paper Project Code STAR improves real-world video super-resolution by addressing over-smoothing and temporalâ€¦
Paper Review: Training Large Language Models to Reason in a Continuous Latent Space

2025å¹´1æœˆ6æ—¥

Paper Review: Training Large Language Models to Reason in a Continuous Latent Space

Paper Coconut (Chain of Continuous Thought) is a new reasoning paradigm for LLMs that operates in latent space, usingâ€¦

1 æ¡è¯„è®º
Paper Review: Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

2024å¹´12æœˆ23æ—¥

Paper Review: Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper Code Weights Blogpost ModernBERT introduces modern optimizations to BERT: trained on 2 trillion tokens with anâ€¦

1 æ¡è¯„è®º

See all articles

Background information on delta rule

Architecture

The approach

Results

Speed

Multimodal Experiments

Andrey Lukyanenkoçš„æ›´å¤šæ–‡ç«

Paper Review: Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Paper Review: Large Language Diffusion Models

Paper Review: NeoBERT: A Next-Generation BERT

Paper Review: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper Review: Goku: Flow Based Video Generative Foundation Models

Paper Review: Titans: Learning to Memorize at Test Time

Paper Review: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper Review: STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Paper Review: Training Large Language Models to Reason in a Continuous Latent Space

Paper Review: Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference