登录查看更多内容

Qwen2.5-Omni Technical Report

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2025年3月27日

Today's paper introduces Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities including text, images, audio, and video while simultaneously generating text and natural speech responses in a streaming manner. The model uses a novel architecture called Thinker-Talker, where the Thinker processes multimodal inputs and generates text, while the Talker produces speech based on the Thinker's representations.

Method Overview

Qwen2.5-Omni employs a Thinker-Talker architecture to process multimodal inputs and generate both text and speech outputs. The Thinker functions like a brain, responsible for processing and understanding inputs from text, audio, and video modalities, generating high-level representations and corresponding text. The Talker operates like a human mouth, taking the high-level representations and text produced by the Thinker in a streaming manner and outputting discrete tokens of speech fluidly.

For processing multimodal inputs, the model uses specialized encoders for different modalities. Text is tokenized using Qwen's tokenizer, audio is transformed into mel-spectrograms, and images/videos are processed using a Vision Transformer (ViT) model. To synchronize the timestamps of video inputs with audio, the paper introduces a new position embedding approach called Time-aligned Multimodal RoPE (TMRoPE), which encodes 3D positional information of multimodal inputs. This allows the model to effectively align audio and visual information in time.

To enable streaming processing of multimodal information, both audio and visual encoders utilize a block-wise processing approach. This strategy decouples the handling of long sequences of multimodal data, assigning perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to the large language model. For video with audio, the model uses a time-interleaving method that segments the representation into chunks and arranges visual representation at the front and audio representation at the back within each chunk.

For speech generation, the Talker receives both high-level representations and embeddings of the text tokens sampled by the Thinker. The model uses an efficient speech codec named qwen-tts-tokenizer that can be decoded to speech in a streaming manner through a causal audio decoder. To facilitate streaming audio generation, the paper implements a sliding window block attention mechanism in the DiT (Diffusion Transformer) model that restricts the current token's access to a limited context, reducing initial latency.

Results

Qwen2.5-Omni shows strong performance across all modalities when benchmarked against similarly sized single-modality models. For text-to-text tasks, it performs between Qwen2-7B and Qwen2.5-7B on most benchmarks, outperforming Qwen2-7B on MMLU-Pro, MMLU-redux, MATH, GSM8K, MBPP, MultiPL-E, and LiveCodeBench.

In audio-to-text tasks, Qwen2.5-Omni delivers better or comparable performance with other state-of-the-art methods on audio understanding, achieving superior ASR and S2TT performance on various test sets. On VoiceBench, it achieves an impressive average score of 74.12, surpassing other audio language models and omni models of similar size.

领英推荐

ElevenLabs: Transforming AI Audio with Realistic Voice…

Tal Navarro 5 个月前

Google Cloud's Chirp 3 Ushers in a New Era of…

Umesh Tharuka Malaviarachchi 1 周前

OpenAI Unveils Next-Generation Audio Models to Power…

Anshuman Jha 1 周前

For image-to-text tasks, Qwen2.5-Omni demonstrates comparable performance to Qwen2.5-VL-7B and attains better results on MMMU, MathVision, MMBench-V1.1-EN, TextVQA, DocVQA, and ChartQA than any other open-sourced omni models. It also surpasses GPT-4o-mini on most benchmarks.

In video-to-text tasks, Qwen2.5-Omni outperforms all other state-of-the-art open-sourced omni models and GPT-4o-Mini, and attains better or competitive results compared to Qwen2.5-VL-7B.

For multimodal understanding, Qwen2.5-Omni achieves state-of-the-art performance on OmniBench, surpassing other Omni models by a large margin with scores of 55.25% for Speech, 60.00% for Sound Event, and 52.83% for Music, resulting in an average of 56.13%.

In speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness, achieving 1.42%, 2.33%, and 6.54% WER on seed-tts-eval test-zh, test-en, and test-hard sets respectively.

Conclusion

Qwen2.5-Omni represents a significant advancement in multimodal AI, offering a unified model capable of processing multiple modalities and generating both text and speech outputs in real-time. For more information please consult the full paper.

Congrats to the authors for their work!

Qwen Team. "Qwen2.5-Omni Technical Report." arXiv preprint arXiv:2503.20215 (2025).

AI Paper of the Day

1,329 位关注者

要查看或添加评论，请登录

Vlad Bogolin的更多文章

TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

2025年4月1日

TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

Today's paper introduces TextCrafter, a novel approach for accurately rendering multiple texts in complex visual…
AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation

2025年3月31日

AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation

Today's paper introduces AdaptiVocab, a novel approach to improve the efficiency of Large Language Models (LLMs) in…
LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis

2025年3月30日

LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis

Today's paper introduces LeX-Art, a comprehensive framework for high-quality text-image synthesis that addresses the…
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

2025年3月29日

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Today's paper introduces VBench-2.0, a comprehensive benchmark suite designed to evaluate video generation models for…
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

2025年3月28日

UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

Today's paper introduces UI-R1, a novel approach that uses reinforcement learning to improve the reasoning capabilities…
Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation

2025年3月26日

Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation

Today's paper introduces FakeVLM, a specialized large multimodal model designed for detecting synthetic images and…
Video-T1: Test-Time Scaling for Video Generation

2025年3月25日

Video-T1: Test-Time Scaling for Video Generation

Today's paper introduces Video-T1, a novel approach that explores the potential of Test-Time Scaling (TTS) for video…
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

2025年3月24日

OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

Today's paper introduces OpenVLThinker, a large vision-language model (LVLM) that demonstrates complex reasoning…
LEGION: Learning to Ground and Explain for Synthetic Image Detection

2025年3月23日

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Today's paper introduces LEGION, a comprehensive framework for synthetic image detection that not only identifies fake…
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

2025年3月22日

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Today's paper introduces JARVIS-VLA, a novel approach for training Vision-Language-Action (VLA) models to play visual…

See all articles

Qwen2.5-Omni Technical Report

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Method Overview

Results

领英推荐

Conclusion

AI Paper of the Day

1,329 位关注者

Vlad Bogolin的更多文章

社区洞察

其他会员也浏览了

SoundBeast AI Review – Clones Any Voice Into UR 100% Human-Like Voices

Generative AI Tools Landscape - Audio Applications – Part2

Rational Considerations in Choosing Speech Input VS Multimodal Input

Decoding Sound: The Rise of Acoustic Tokens in Audio Machine Learning

Protecting Against Audio Deepfakes: Innovative Solutions and Ongoing Challenges

Experimenting with NotebookLM's New Audio Overview Feature

Elevating Audio Datasets: The Power of Augmentation Techniques

VITA multimodal LLM

The Now and Future of Speech Interfaces

WaveNet: A Generative Model for Raw Audio

Method Overview

Results

领英推荐

Conclusion

AI Paper of the Day

1,329 位关注者

Vlad Bogolin的更多文章

TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation

LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation

Video-T1: Test-Time Scaling for Video Generation

OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

LEGION: Learning to Ground and Explain for Synthetic Image Detection

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

社区洞察

其他会员也浏览了

SoundBeast AI Review – Clones Any Voice Into UR 100% Human-Like Voices

Generative AI Tools Landscape - Audio Applications – Part2

Rational Considerations in Choosing Speech Input VS Multimodal Input

Decoding Sound: The Rise of Acoustic Tokens in Audio Machine Learning

Protecting Against Audio Deepfakes: Innovative Solutions and Ongoing Challenges

Experimenting with NotebookLM's New Audio Overview Feature

Elevating Audio Datasets: The Power of Augmentation Techniques

VITA multimodal LLM

The Now and Future of Speech Interfaces

WaveNet: A Generative Model for Raw Audio