登录查看更多内容

Paper Review: Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM

Andrey Lukyanenko

Data Scientist / Machine Learning Engineer. Kaggle Competition Master, Notebooks Top-1.

发布日期: 2023年11月13日

A novel approach adapts pre-trained LLMs for question answering and speech continuation by incorporating a pre-trained speech encoder. This enables the model to handle speech inputs and outputs. The model is trained end-to-end on spectrograms. The training objective jointly supervises speech recognition, text continuation, and speech synthesis using paired speech-text pairs. This allows for a “cross-modal” chain-of-thought in a single decoding pass. The method outperforms existing spoken language models in maintaining speaker characteristics and semantic coherence and shows improved retention of the original LLM’s knowledge in spoken QA tasks.

Approach

Architecture. The encoder processes a speech utterance into continuous linguistic features, which are then used as a prefix for the pre-trained language decoder. The model is optimized to minimize both a cross-entropy loss (for speech recognition and transcript continuation) and a novel reconstruction loss (for speech continuation). During inference, a spoken speech prompt is encoded and decoded to generate both text and speech continuations.

Input pre-processing. The model is trained with supervised speech utterances comprising paired speech spectrograms and transcripts. The spectrogram is split into a prompt segment for the speech encoder and a continuation segment for spectrogram reconstruction loss. SpecAugment is used for data augmentation. The transcript is divided similarly, with a mapping function aligning speech features with text tokens.

Speech encoder. A 600M-parameter Conformer encoder, pre-trained on 12M hours of data, processes speech spectrograms, generating representations with both linguistic and acoustic details. The process involves subsampling, followed by Conformer blocks containing feed-forward, self-attention, convolution, and additional feed-forward layers. The output is projected to the language model’s embedding dimension.

Language model. The model uses prefix decoder language models with either 350M or 1B parameters, trained like PaLM 2. The LM receives encoded features of the speech prompt as a prefix, with the speech encoder and LM decoder connected only at this point, without cross-attention. This late integration approach, similar to advancements in ASR, has shown to improve performance, indicating that additional layers are unnecessary due to the powerful text representations. During training, the decoder is teacher-forced to predict the text transcription, text continuation, and speech embeddings. The LM utilizes lightweight modules for converting between speech embeddings and spectrograms. This process benefits from the LM’s pre-training and intermediate text reasoning, enhancing speech synthesis quality.

Acoustic projection layers. A multi-layer perceptron adapts the LM decoder for speech features, compressing spectrogram continuations into the LM dimension to create a bottleneck, aiding decoding and preventing repetitive predictions. Another perceptron reconverts these projections to the spectrogram dimension.

Training objective. The training of the model uses two loss functions: cross-entropy loss (for speech recognition and transcript continuation) and regression loss (for speech continuation).

领英推荐

How Reverie’s APIs Transform Chat and Voice Bots in…

Reverie Language Technologies 6 个月前

Human and Machine Language Acquisition: A…

Yash Sharma 1 个月前

Unlocking the Power of Bilingual AI: SandLogic Lexicon…

Kamalakar Devaki 6 个月前

Inference. In the inference stage, the model encodes a speech prompt using a speech encoder and projects it to the language model’s dimension. The LM then autoregressively decodes this data to generate a text transcription and continuation, followed by decoding a spectrogram for speech continuation. This involves predicting and converting spectrogram features using past estimates and a post-net. Finally, a vocoder transforms the predicted spectrogram into a waveform, enabling the model to output both text and speech from the initial speech input.

Experiments

Semantic Quality: The semantic quality of speech output was measured using log-perplexity. A Conformer ASR system transcribes speech continuations, and GPT-2 Medium calculates log-perplexity. Spectron outperforms previous approaches like GSLM and AudioLM.
Acoustic Quality: Acoustic quality is assessed using Mean Opinion Score and average cosine distance between speaker embeddings. MOS rates the naturalness of speech utterances, while average speaker similarity measures the resemblance between the input prompt and its generated continuation. Spectron slightly outperforms GSLM in MOS and demonstrates significant improvement in average speaker similarity over GSLM and both AudioLM variants. However, it’s slightly inferior to AudioLM’s 12-RVQ variant in MOS.
Question Answering: The model’s ability to continue spoken sentences or questions with appropriate answers is tested using synthesized questions from WebQuestions and a newly created LLama questions set. The performance is measured by the accuracy of answers transcribed by a Conformer ASR system. The proposed model shows competitive results against larger models like SpeechGPT on both LLama and Web Questions test sets despite having a smaller parameter size. Other models like TWIST and GSLM exhibit lower accuracies, suggesting a tendency towards generating completions rather than direct answers.

The ablations involved removing the following components from the model and measuring the log-perplexity on the LibriSpeech dataset: intermediate loss on text, spectrogram derivative loss, pre-training of the language model, and pre-training of the speech encoder. The results revealed that each component significantly contributes to the model’s performance. The removal of the ASR & LM cross-entropy loss and the spectrogram derivative loss had the most substantial impact, causing a substantial increase in log-perplexity. Additionally, the absence of pre-training for either the speech encoder or the language model also led to a notable decline in performance. The largest degradation in log-perplexity occurred when both the speech encoder and the pre-trained language model were removed, highlighting the importance of these components in the model’s effectiveness.

Limitations

The main limitation is the high computational complexity in generating spectrogram frames, which limits the ability to produce long speech utterances due to the frame computation rate of 12.5 ms. A potential solution is to generate multiple spectrogram frames from each hidden representation.
Another significant challenge is the non-parallelizability of text and spectrogram decoding processes, causing latency issues in streaming scenarios and delaying audio output.
Additionally, there is a potential for biases in the pre-trained language model to be perpetuated in the new model.

Micheal Reed

Founder | Practical

10 个月

Would love to connect about this.

1 次回应

Lena Gorban, PMP?

Digital Transformation Consultant & Project manager | Marketing Automation | CDP

1 年

You are unbelievably cool!

3 次回应

查看更多评论

要查看或添加评论，请登录

Andrey Lukyanenko的更多文章

Paper Review: RWKV-7 “Goose” with Expressive Dynamic State Evolution

2025年3月24日

Paper Review: RWKV-7 “Goose” with Expressive Dynamic State Evolution

Paper Code Project RWKV-7 “Goose” is a new sequence modeling architecture that achieves state-of-the-art performance on…
Paper Review: Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

2025年3月17日

Paper Review: Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Paper Code Project Audio Flamingo 2 is an Audio-Language Model with advanced audio understanding and reasoning…

2 条评论
Paper Review: Large Language Diffusion Models

2025年3月10日

Paper Review: Large Language Diffusion Models

Paper Code Project LLaDA is a diffusion-based alternative to autoregressive models for LLMs. It models distributions…
Paper Review: NeoBERT: A Next-Generation BERT

2025年3月3日

Paper Review: NeoBERT: A Next-Generation BERT

Paper Code NeoBERT is a next-generation bidirectional encoder; it incorporates state-of-the-art architectural…
Paper Review: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

2025年2月24日

Paper Review: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper Project SigLIP 2 is a new family of multilingual vision-language encoders that improve upon the original SigLIP…
Paper Review: Goku: Flow Based Video Generative Foundation Models

2025年2月17日

Paper Review: Goku: Flow Based Video Generative Foundation Models

Paper Code Project Goku is a family of joint image-and-video generation models built on rectified flow Transformers…

1 条评论
Paper Review: Titans: Learning to Memorize at Test Time

2025年2月3日

Paper Review: Titans: Learning to Memorize at Test Time

Paper Titans is a new neural architecture that combines attention mechanisms with a long-term memory module to…

6 条评论
Paper Review: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

2025年1月27日

Paper Review: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper Project Hugging Face page Code The DeepSeek team introduces two reasoning models, DeepSeek-R1-Zero and…
Paper Review: STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

2025年1月13日

Paper Review: STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Paper Project Code STAR improves real-world video super-resolution by addressing over-smoothing and temporal…
Paper Review: Training Large Language Models to Reason in a Continuous Latent Space

2025年1月6日

Paper Review: Training Large Language Models to Reason in a Continuous Latent Space

Paper Coconut (Chain of Continuous Thought) is a new reasoning paradigm for LLMs that operates in latent space, using…

1 条评论

See all articles

Paper Review: Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM

Andrey Lukyanenko

Data Scientist / Machine Learning Engineer. Kaggle Competition Master, Notebooks Top-1.

Approach

领英推荐

Experiments

Limitations

Andrey Lukyanenko的更多文章

社区洞察

其他会员也浏览了

The BiCity AI Project Aims to Generate Text And Articles Autonomously

The Timeless Power of the Written Word: Enhancing AI with Transcription

AI Voice & Speech Generation - Latest Breakthroughs

AI that can learn the patterns of human language

Text to Speech vs. Speech to Text: What’s the difference?

How do Voice Bots Handle Languages and Accents?

Google Research's CodecLM - Aligning Language Models with Tailored Synthetic Data & Overview of Multilingual Large Language Models

Leveraging AI to Classify Arabic-Persian Texts in Hate Speech Detection

See how I materialize language with?AI ...

?AI, language, and me ...

Approach

领英推荐

Experiments

Limitations

Andrey Lukyanenko的更多文章

Paper Review: RWKV-7 “Goose” with Expressive Dynamic State Evolution

Paper Review: Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Paper Review: Large Language Diffusion Models

Paper Review: NeoBERT: A Next-Generation BERT

Paper Review: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper Review: Goku: Flow Based Video Generative Foundation Models

Paper Review: Titans: Learning to Memorize at Test Time

Paper Review: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper Review: STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Paper Review: Training Large Language Models to Reason in a Continuous Latent Space

社区洞察

其他会员也浏览了

The BiCity AI Project Aims to Generate Text And Articles Autonomously

The Timeless Power of the Written Word: Enhancing AI with Transcription

AI Voice & Speech Generation - Latest Breakthroughs

AI that can learn the patterns of human language

Text to Speech vs. Speech to Text: What’s the difference?

How do Voice Bots Handle Languages and Accents?

Google Research's CodecLM - Aligning Language Models with Tailored Synthetic Data & Overview of Multilingual Large Language Models

Leveraging AI to Classify Arabic-Persian Texts in Hate Speech Detection

See how I materialize language with?AI ...

?AI, language, and me ...