Peek into the future
The Devil is in the details, a often hidden small detail that we must not miss when interpreting performance figures.
Beneath is a plot (from https://paperswithcode.com/sota/speech-separation-on-wsj0-2mix) showing the advancement of speech separation models over the years.
It’s a pretty encouraging picture, before considering employing any of them for your application, there is one detail cannot be missed, it’s whether the model is causal.
“Causal” here means does NOT use future information to make decision at current time step. Let’s use Conv-Tasnet (arXiv:1809.07454v3) as example since it has a very good comparison table.
Same architecture, simply change the model to be causal, both LSTM-Tasnet and Conv-Tasnet have a noticeable performance drop and seems to hit some kind of ceiling.
Non-causal models have access to wider context, both past and future (relative to current time step), hence the potential of achieving better result, especially with transformer architecture. This suits non-realtime applications or ones that can afford significant latency. Realtime processing can only make use of the past, therefore a lower performance ceiling is expected.
Edge AI applications often have realtime or sharp responsiveness as requirement, hence model being causal is essential. Although with a lower ceiling, the trend is we get there with less and less computation. The earlier Conv-Tasnet example demonstrates that too, significantly reduce the number of parameters while reaching same performance level.
Not all designs are realtime friendly or include causal variant as part of verification, in those cases extra evaluation is needed to gain further understanding.
Worth highlighting here, this blog is Edge AI focused, hence all examples by default are realtime oriented implementation.