Paper Review: Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Andrey Lukyanenko
Data Scientist / Machine Learning Engineer. Kaggle Competition Master, Notebooks Top-1.
Audio Flamingo 2 is an Audio-Language Model with advanced audio understanding and reasoning capabilities. AF2 achieves state-of-the-art performance with a compact 3B parameter model, surpassing larger open-source and proprietary models across 20+ benchmarks. Key components include a custom CLAP model, synthetic Audio QA data, and a multi-stage curriculum learning strategy. AF2 extends audio understanding to long segments (30 secs to 5 mins) and introduces LongAudio, a dataset for long audio captioning and question-answering, as well as LongAudioBench, an expert-annotated benchmark for long-audio understanding.
The architecture
AF-CLAP is an improved version of the CLAP audio encoder, designed to enhance linguistic robustness and compositional reasoning in Audio-Language Models. Standard CLAP models struggle with limited high-quality training data and inconsistencies in linguistic variations and sound relationships. AF-CLAP addresses these issues through:
AF2 extracts dense audio features from the penultimate layer of AF-CLAP, generating higher-quality representations compared to mean pooling. For longer audio, a non-overlapping sliding window approach is used, with RoPE encoding to encode temporal information.
To enhance model capacity, AF2 applies three self-attention layers. Gated cross-attention layers condition audio representations on the LLM, reducing quadratic attention complexity to linear complexity.
AF2 uses Qwen2.5-3B as a base model.
Training data
AudioSkills is a synthetic audio reasoning dataset designed to enhance problem-solving and reasoning abilities beyond basic acoustic event classification. It includes ~4.2M QA pairs generated from open-source sound and music datasets, synthetic audio, and GPT-4o-generated metadata.
The dataset covers seven key reasoning skills:
LongAudio is the first large-scale long audio understanding dataset; it has 80K+ unique audios and 263K AQA pairs. To ensure diversity, videos are clustered and selectively sampled, with captions generated using Qwen2-VL-2B-Instruct and Qwen2-Audio, while GPT-4o creates reasoning-based questions.
LongAudioBench is a high-quality benchmark subset; it consists of 2429 human-verified instances sampled from LongAudio. GPT-4o is used as a judge for evaluation, scoring responses from 1 to 10 based on correctness.
Training strategy
AF2 is trained using a 3-stage curriculum learning strategy, progressively increasing audio context length and data quality:
Experiments
The model is trained using 128 NVIDIA A100 80GB GPUs.
AF2 outperforms larger models despite having a smaller 3B LLM, excelling in both foundational audio understanding and expert-level reasoning across standard benchmarks.
Lighting Engineer | AI/ML Practitioner | MSCS @ Georgia Tech
5 天前Hey, just a heads-up. The first link doesn't work ??