Qwen2.5-Omni Technical Report
Credit: https://arxiv.org/pdf/2503.20215

Qwen2.5-Omni Technical Report

Today's paper introduces Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities including text, images, audio, and video while simultaneously generating text and natural speech responses in a streaming manner. The model uses a novel architecture called Thinker-Talker, where the Thinker processes multimodal inputs and generates text, while the Talker produces speech based on the Thinker's representations.

Method Overview

Qwen2.5-Omni employs a Thinker-Talker architecture to process multimodal inputs and generate both text and speech outputs. The Thinker functions like a brain, responsible for processing and understanding inputs from text, audio, and video modalities, generating high-level representations and corresponding text. The Talker operates like a human mouth, taking the high-level representations and text produced by the Thinker in a streaming manner and outputting discrete tokens of speech fluidly.

For processing multimodal inputs, the model uses specialized encoders for different modalities. Text is tokenized using Qwen's tokenizer, audio is transformed into mel-spectrograms, and images/videos are processed using a Vision Transformer (ViT) model. To synchronize the timestamps of video inputs with audio, the paper introduces a new position embedding approach called Time-aligned Multimodal RoPE (TMRoPE), which encodes 3D positional information of multimodal inputs. This allows the model to effectively align audio and visual information in time.

To enable streaming processing of multimodal information, both audio and visual encoders utilize a block-wise processing approach. This strategy decouples the handling of long sequences of multimodal data, assigning perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to the large language model. For video with audio, the model uses a time-interleaving method that segments the representation into chunks and arranges visual representation at the front and audio representation at the back within each chunk.

For speech generation, the Talker receives both high-level representations and embeddings of the text tokens sampled by the Thinker. The model uses an efficient speech codec named qwen-tts-tokenizer that can be decoded to speech in a streaming manner through a causal audio decoder. To facilitate streaming audio generation, the paper implements a sliding window block attention mechanism in the DiT (Diffusion Transformer) model that restricts the current token's access to a limited context, reducing initial latency.

Results

Qwen2.5-Omni shows strong performance across all modalities when benchmarked against similarly sized single-modality models. For text-to-text tasks, it performs between Qwen2-7B and Qwen2.5-7B on most benchmarks, outperforming Qwen2-7B on MMLU-Pro, MMLU-redux, MATH, GSM8K, MBPP, MultiPL-E, and LiveCodeBench.

In audio-to-text tasks, Qwen2.5-Omni delivers better or comparable performance with other state-of-the-art methods on audio understanding, achieving superior ASR and S2TT performance on various test sets. On VoiceBench, it achieves an impressive average score of 74.12, surpassing other audio language models and omni models of similar size.

For image-to-text tasks, Qwen2.5-Omni demonstrates comparable performance to Qwen2.5-VL-7B and attains better results on MMMU, MathVision, MMBench-V1.1-EN, TextVQA, DocVQA, and ChartQA than any other open-sourced omni models. It also surpasses GPT-4o-mini on most benchmarks.

In video-to-text tasks, Qwen2.5-Omni outperforms all other state-of-the-art open-sourced omni models and GPT-4o-Mini, and attains better or competitive results compared to Qwen2.5-VL-7B.

For multimodal understanding, Qwen2.5-Omni achieves state-of-the-art performance on OmniBench, surpassing other Omni models by a large margin with scores of 55.25% for Speech, 60.00% for Sound Event, and 52.83% for Music, resulting in an average of 56.13%.

In speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness, achieving 1.42%, 2.33%, and 6.54% WER on seed-tts-eval test-zh, test-en, and test-hard sets respectively.

Conclusion

Qwen2.5-Omni represents a significant advancement in multimodal AI, offering a unified model capable of processing multiple modalities and generating both text and speech outputs in real-time. For more information please consult the full paper.

Congrats to the authors for their work!

Qwen Team. "Qwen2.5-Omni Technical Report." arXiv preprint arXiv:2503.20215 (2025).

要查看或添加评论,请登录

Vlad Bogolin的更多文章

社区洞察

其他会员也浏览了