ML Model Exploration Journey: From Classical Approaches to Hybrid Innovations
Our journey in speech enhancement has been one of continuous exploration and adaptation. As we encountered limitations in classical signal processing techniques, we proceeded into novel architectures and cutting-edge technologies to push the boundaries of what was possible in denoising and speech enhancement. Here's a reflection on how our approach evolved.
Demucs
To address the challenges posed by classical signal processing, we turned to raw waveform-based models. These models eliminated the need for parameter tuning in frequency domain representations (e.g., window size, overlap) and preserved temporal details that were often lost during transformations like the Short-Time Fourier Transform (STFT).
We adopted Demucs[1], an encoder-decoder-based U-Net architecture originally developed for music source separation and tailored it for speech enhancement. Designed to operate directly in the waveform domain, Demucs processes noisy audio to effectively separate clean speech without relying on transformations like the STFT. This approach not only eliminates the need for frequency domain parameter tuning but also ensures the preservation of fine temporal details critical for speech clarity.
The architecture of Demucs combines the strengths of a multi-layer convolutional encoder-decoder framework with U-Net-style skip connections, which help recover finer details lost during deeper layer processing. This ensures a more accurate reconstruction of the clean speech signal. Additionally, the integration of recurrent neural networks (RNNs) in the form of LSTMs allows the model to capture long-term temporal dependencies, making it well-suited for the dynamic nature of audio signals. Each encoder and decoder layer is equipped with gated linear unit (GLU) activations, which improve both the efficiency and expressiveness of the model by allowing it to focus on the most relevant features during training and inference.
We initially re-implemented Demucs in TensorFlow for training and deployed it in C++ using Android NNAPI to overcome TensorFlow Lite's operational limitations at the time.
By 2024, advancements in TensorFlow Lite addressed many of these limitations, allowing for seamless deployment as a TensorFlow Lite file, significantly simplifying the pipeline. While Demucs outperformed frequency domain methods like RNNoise in many scenarios, challenges such as handling extremely low SNR inputs and occasional voice clipping persisted. Furthermore, because raw waveform models learn patterns directly from the data without explicitly analyzing features like harmonics or spectral characteristics, they struggled to fully capture details critical for aligning with how humans naturally perceive sound.
Hybrid Demucs
To further improve performance, we explored a hybrid approach that combined raw waveform and frequency domain features within a Hybrid Demucs (HD)[2] architecture. This update introduced a multi-domain analysis framework by incorporating both temporal and spectral branches, allowing the model to leverage the strengths of each representation.
The temporal branch processes raw waveforms directly, preserving fine temporal details crucial for capturing dynamic audio variations. In contrast, the spectral branch operates on spectrograms derived from the input signal, enabling the model to capture frequency-domain features such as harmonics and spectral patterns that are challenging for raw waveform models to extract explicitly. These two branches are carefully aligned and merged into a shared core layer, allowing the architecture to decide which domain is better suited for different parts of the signal and even combine information from both domains dynamically.
This hybrid design retains the core U-Net structure of the original Demucs, with parallel encoder-decoder pathways in both temporal and spectral domains. Skip connections in both branches ensure that fine-grained details are preserved during reconstruction, while shared layers provide a common ground for blending the two representations. The final output combines the processed spectrogram (inverted back to the time domain) with the waveform branch output, producing a coherent and enhanced audio signal.
By combining the two domains, Hybrid Demucs overcomes limitations of purely waveform-based models, such as the inability to explicitly capture harmonic structures, while also addressing the artifacts often introduced by traditional spectrogram-based methods. This approach not only improved the objective evaluations but also enhanced subjective audio quality, reducing artifacts like phase inconsistencies and static noise.
Transition to transformer-based architecture
The emergence of transformer-based models marked a pivotal shift in our approach. By transitioning from RNNs to transformer architectures(Hybrid transformer Demucs(HTD)[3]), we significantly enhanced the model's capacity to handle long-range context dependencies. Unlike RNNs, which process sequences sequentially and face limitations in capturing global context efficiently, transformers leverage self-attention mechanisms to process the entire sequence in parallel. This shift not only increased processing speed but also reduced training time, making the model more scalable for large datasets and longer input sequences.
The self-attention mechanism in transformers provided a robust framework for evaluating relationships between all parts of the sequence simultaneously. This allowed the model to effectively weigh and combine information from different parts of the input, improving its ability to generalize across diverse audio contexts. Furthermore, by employing cross-attention layers in our hybrid architecture, we facilitated information exchange between the temporal and spectral domains. This cross-domain interaction enabled the model to dynamically decide the optimal representation for different parts of the signal, enhancing separation quality.
This transition to transformer-based architectures not only resolved the limitations of RNNs but also unlocked new possibilities in leveraging long-range dependencies, cross-domain representations, and scalable computation, paving the way for state-of-the-art performance in speech enhancement tasks.
Exploring Band Split RNN
Besides our journey with a hybrid approach, we explored an amazing work in the frequency domain named band-split RNN (BSRNN)[4]). This model employs a unique strategy of prioritizing specific frequency ranges by splitting signals into multiple bands. Notably, the frequency splitting is designed such that lower frequency ranges—where the fundamental frequencies of speech reside—have narrower bands, while higher frequencies are grouped into broader bands. This prioritization mirrors the way humans perceive sound, emphasizing the frequencies most critical for speech intelligibility.
The BSRNN processes each frequency band separately, leveraging band-specific layers for modeling. By isolating and focusing on subband features, it achieves precise noise suppression while improving speech quality. After training the BSRNN, we observed impressive results: the model effectively reduced noise and increased original speech quality at the same time.
However, we identified limitations. The HTD is still better at denoising than the BSRNN. Also, it generates artifacts that slightly compromise the listening experience. A likely cause is the model's reliance on the input mixture's phase without attempting to model or enhance it. This omission can introduce phase inconsistencies, which manifest as distortions in the output audio. Additionally, we observed that the artifacts were particularly prominent in higher frequencies. A similar issue was reported in NVIDIA's research on speech denoising[5], where artifacts also appeared in the high-frequency range. They addressed this problem by incorporating a multi-resolution STFT loss for enhanced representation of higher frequencies. Moving forward, we plan to adopt this approach to mitigate the artifacts in our model.
领英推荐
Results and Evaluations
We have evolved significantly throughout our journey, as reflected in the results of effective evaluation approaches like DNSMOS P.808 and DNSMOS P.835 on the Deep Noise Suppression (DNS) 2022 test set. Our continuous improvements are evident in the enhanced scores achieved over time.
Band Split-based Hybrid Transformer Demucs
Through our parallel exploration, we uncovered distinct strengths in both the HTD model and the BSRNN model. The HTD model excels by effectively combining raw waveform and frequency-domain information, leveraging the complementary strengths of each domain. In contrast, the BSRNN model demonstrates exceptional efficiency and precision in processing frequency-domain information, thanks to its band-split approach. This approach meticulously divides the frequency spectrum into smaller subbands, focusing on critical frequency ranges for improved processing.
Given these observations, we hypothesize that integrating the BSRNN's band-split mechanism into the frequency-domain processing of the HTD model could synergize their strengths. By adopting this integration, the HTD model could benefit from BSRNN's precision in frequency-domain operations while retaining its capacity to process raw waveforms in parallel. This dual-domain approach not only enhances the model's ability to tackle noise reduction and speech enhancement but also addresses a key limitation of BSRNN: artifact generation caused by its reliance on the input mixture's phase.
Since the HTD model inherently processes raw waveforms alongside the frequency domain, it holds the potential to resolve phase modeling challenges. This could mitigate the artifacts introduced by the BSRNN and lead to more natural and artifact-free outputs. By bridging these strengths, we aim to create a hybrid system that maximizes the benefits of both models, ultimately achieving superior performance in speech enhancement tasks.
Additional future directions for improvement
In the MetricGAN [6] paper, the authors propose an approach for optimizing models based on one or more evaluation metrics, which enables improvements from multiple perspectives. We aim to leverage this concept by incorporating a metricGAN discriminator (MGD) to enhance our model further.
Additionally, when using full-band mel-spectrograms, the authors of Multi-Resolution Spectrogram Discriminator (MRSD)[7] encountered an over-smoothing issue, resulting in less defined spectrograms. To tackle this, the authors introduced a multi-resolution spectrogram discriminator, which utilizes multiple linear spectrogram magnitudes computed with various parameter sets. We plan to integrate this during the fine-tuning stage to improve the human auditory perceptual quality of the enhanced speech.
Conclusion
Our journey in speech enhancement has been one of constant evolution—from addressing the limitations of classical approaches with raw waveform models like Demucs, to pioneering hybrid architectures with Hybrid Demucs, and ultimately embracing the power of transformers in Hybrid Transformer Demucs. Along the way, insights from models like BSRNN have inspired innovative integrations, bringing us closer to a unified system that excels in precision and adaptability.
Looking forward, we are committed to unifying the strengths of our research efforts with existing research work to build a state-of-the-art speech enhancement system. This journey reflects our dedication to pushing boundaries and exploring new frontiers, driven by a vision to create models that not only meet the challenges of today but also anticipate the needs of tomorrow.
Explore our Solutions
Our model source code is available on GitHub with an MIT license.
References: