LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Today's paper introduces LLaMA-Omni, a new model architecture for low-latency and high-quality speech interaction with large language models (LLMs). The system can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. It eliminates the need for speech transcription and provides better responses in both content and style compared to previous speech-language models.
Method Overview
LLaMA-Omni consists of four main components: a speech encoder, a speech adaptor, a large language model (LLM), and a streaming speech decoder.
The speech encoder, based on the Whisper model, extracts meaningful representations from the input speech. The speech adaptor then maps these representations into the embedding space of the LLM, allowing it to understand the speech input directly.
The core of the system is the LLaMA-3.1-8B-Instruct model which processes the adapted speech representations and generates text responses.
Simultaneously, a streaming speech decoder takes the output hidden states from the LLM and generates a sequence of discrete units corresponding to the speech response. This decoder operates in a non-autoregressive manner, allowing for parallel generation of speech output alongside the text.
To train the model, they developed a two-stage process. In the first stage, they train the speech adaptor and fine-tune the LLM to process speech inputs. In the second stage, they train the speech decoder while keeping the other components frozen.
To better align the model with speech interaction scenarios, they created a dataset called InstructS2S-200K, which includes 200,000 speech instructions and corresponding speech responses.
领英推荐
Results
LLaMA-Omni achieved impressive results, demonstrating:
Conclusion
LLaMA-Omni represents a new advancement in speech interaction with large language models. By enabling direct speech-to-speech generation with extremely low latency, it paves the way for more natural and efficient human-AI interactions. For more information please consult the?full paper.
Congrats to the authors for their work!
Fang, Qingkai, et al. "LLaMA-Omni: Seamless Speech Interaction with Large Language Models." arXiv preprint arXiv:2409.06666 (2024).