Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Today's paper introduces Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) that achieves state-of-the-art performance in audio understanding and reasoning tasks. The model combines a 3B-parameter language model with a 203M-parameter audio encoder, demonstrating exceptional capabilities in processing both short and long audio segments. AF2 is the first model to extend audio understanding to segments up to 5 minutes long.
Method Overview
Audio Flamingo 2 combines a custom CLAP (Contrastive Language-Audio Pre-training) audio encoder with a decoder-only language model through gated cross-attention layers. The architecture consists of four main components: AF-CLAP (the audio encoder), audio representation transformation layers, a decoder-only language model, and gated cross-attention layers for audio conditioning.
The AF-CLAP audio encoder is trained on over 8 million audio-caption pairs, significantly larger than previous datasets. The training incorporates an improved contrastive loss function that enhances linguistic invariance (ability to understand different phrasings of the same concept) and compositional reasoning (understanding relationships between acoustic events). For long audio processing, AF2 uses a sliding window approach to extract features from segments up to 5 minutes long.
The training process follows a three-stage curriculum learning strategy. In the pre-training stage, the model focuses on multi-modal alignment using large-scale classification and captioning datasets, with only the audio representation transformation and cross-attention layers being trainable. The fine-tuning stage improves audio understanding and reasoning by training on high-quality short-audio datasets, with the CLAP model becoming trainable as well. The long fine-tuning stage extends the context length to 5 minutes using the newly introduced LongAudio dataset.
To support this training approach, the paper introduces two datasets: AudioSkills, which contains 4.2 million question-answer pairs designed to develop seven distinct reasoning skills (temporal reasoning, attribute identification, counting, contextual sound event reasoning, contextual speech event reasoning, information extraction, and general reasoning); and LongAudio, the first large-scale long audio understanding dataset with over 80,000 unique audios and approximately 263,000 question-answer pairs.
Results
Audio Flamingo 2 achieves state-of-the-art performance across more than 20 benchmarks, outperforming larger and proprietary models despite having a smaller footprint (3B parameters compared to competitors' 7B+ parameters). On foundational audio understanding tasks, AF2 shows competitive results against larger models. For expert reasoning tasks, it significantly outperforms all previous models, with particularly impressive gains on challenging benchmarks like MMAU Music (+16.4%), Audio Entailment AudioCaps (+29.1%), and CompA-R-test (+16.4%).
The paper demonstrates that high-quality data often surpasses the performance gains achieved by simply scaling compute. Models trained with AudioSkills show superior reasoning capabilities even at smaller LLM sizes. The results also confirm that the cross-attention architecture outperforms prefix-tuning approaches used in previous models.
For long audio understanding, AF2 achieves a score of 64.2% on the newly introduced LongAudioBench, significantly outperforming the previous best model (Gemini Flash v2) which scored 45.3%.
Conclusion
Audio Flamingo 2 achieves state-of-the-art performance in both short and long audio understanding and reasoning tasks. The paper demonstrates that improvements in data quality, audio representations, and training strategies can lead to superior performance even with smaller model sizes. For more information please consult the full paper.
Congrats to the authors for their work!
Ghosh, Sreyan, et al. "Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities." arXiv preprint arXiv:2503.03983 (2025).