Professor Wenwu Wang Makes Waves in Language-Audio AI at International Workshops
Surrey Institute for People-Centred AI (PAI)
Putting people at the heart of AI.
Professor Wenwu Wang has given three invited keynote talks on three international workshops, and was also an invited member for a panel discussion, in December 2024.
The three keynote talks are respectively:
1. IEEE International Workshop on Spoken Language Technology (SLT 2024), held in Macau, China, 2nd to 5th December 2024. The keynote speech was titled “Large Language-Audio Models and Applications”. This event was attended by about 400 people in person.
Abstract of the talk:
Large Language Models (LLMs) are being explored in audio processing to interpret and generate meaningful patterns from complex sound data, such as speech, music, environmental noise, sound effects, and other non-verbal audio. Combined with acoustic models, LLMs offer great potential for addressing a variety of problems in audio processing, such as audio captioning, audio generation, source separation, and audio coding. This talk will cover recent advancements in using LLMs to address audio-related challenges. Topics will include the language-audio models for mapping and aligning audio with textual data, their applications across various audio tasks, the creation of language-audio datasets, and potential future directions in language-audio learning. We will demonstrate our recent works in this area, for example, AudioLDM, AudioLDM2 and WavJourney for audio generation and storytelling, AudioSep for audio source separation, ACTUAL for audio captioning, SemantiCodec for audio coding,?WavCraft?for content creation and editing, and APT-LLMs for audio reasoning, and the datasets WavCaps, Sound-VECaps, and AudioSetCaps for training and evaluating large language-audio models.
2. The Codec-SUPERB workshop, held along with SLT 2024, on 3rd December 2024. This keynote speech was titled “Neural Audio Codecs: Recent Progress and a Case Study with SemantiCodec”. This event was attended by about 400 people in person.
领英推荐
Abstract of the talk:?The neural audio codec has attracted increasing interest as a highly effective method for audio compression and representation. By transforming continuous audio into discrete tokens, it facilitates the use of large language modelling (LLM) techniques in audio processing. In this talk, we will report recent progress in neural audio codecs, with a particular focus on SemantiCodec, a new neural audio codec for ultra-low bit rate audio compression and tokenization. SemantiCodec features a dual-encoder architecture: a semantic encoder using a self-supervised AudioMAE, discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining details. The semantic and acoustic encoder outputs are then used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec presents several advantages over previous codecs, which typically operate at high bitrates, are confined to narrow domains like speech, and lack the semantic information essential for effective language modelling. First, SemantiCodec compresses audio into fewer than 100 tokens per second across various audio types, including speech, general audio, and music, while maintaining high-quality output. Second, it preserves substantially richer semantic information from audio compared to all evaluated codecs. We will illustrate these benefits through benchmarking and conclude by discussing potential directions for future research in this field.
3. The China Computer Federation (CCF) Workshop on Speech Processing in the Era of Large Language Models, held on the 6th December 2024, at Shenzhen, China.? The keynote speech was titled “Language Queried Audio Source Separation”. This event was attended by more than 100 people in person, and 6100 people online.
Abstract of the talk:
Language-queried audio source separation (LASS) is a paradigm that we proposed recently for separating sound sources of interest from an audio mixture using a natural language query. The development of LASS systems offers intuitive and scalable interface tools that are potentially useful for digital audio applications, such as automated audio editing, remixing, and rendering. In this talk, we will first introduce the problem setting and motivation, making a connection with conventional paradigms including speech source separation and universal audio source separation. We then present our two newly developed LASS algorithms, AudioSep and FlowSep, respectively. AudioSep is a foundational model for open-domain audio source separation driven by natural language queries. It employs a query network and a separation network to predict time-frequency masks, enabling the extraction of target sounds based on text prompts. The model was trained on large-scale multimodal datasets and evaluated extensively on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. FlowSep is a new generative model for LASS based on rectified flow matching (RFM), which models linear flow trajectories from noise to target source features within the latent space of a variational autoencoder (VAE). During inference, the RFM-generated latent features are used to reconstruct a mel-spectrogram through the pre-trained VAE decoder, which is then passed to a pre-trained vocoder to synthesize the waveform. After this, we will discuss the datasets and performance metrics we developed for evaluating the LASS systems, and the organisation of Task 8 of DCASE 2024 international challenge, building on the AudioSep model. Finally, we conclude the talk by outlining potential future research directions in this area.
?
The panel discussion was held on SLT 2024 around the topics of large language models for spoken language processing, chaired by Prof Hung-yi Lee (National Taiwan University), joined by other panel members including Prof Mark A. Hasegawa-Johnson (Professor at University of Illinois at Urbana-Champaign), Prof Kai Yu (Professor at Shanghai Jiaotong University & Chief Scientist of AISpeech), Dr Jinyu Li (Partner Applied Science Manager at Microsoft), Dr Junlan Feng (Chief Scientist at China Mobile Research Institute), and Junichi Yamagishi (Professor at National Institute of Informatics, Japan).