登录查看更多内容

SVDF, just give Conv a bit of time

Weiming Li

Machine Learning Signal Processing | MLSP.ai

发布日期: 2025年1月19日

Simply add a dimension of time to standard Conv layer, it becomes the SVDF layer, the core component powering our favorite keyword recognition.

Here is how SVDF works, have to lend help from 3D drawing to highlight the different dimensions.

In one sentence, it’s a FIFO (First-In-First-Out) memory sandwiched by two standard Conv layers. Let’s add some context to help grasping the idea behind. Assuming the input is speech data and the feature convolution extracts 1kHz signal (like a FIR filter) from the input, then the FIFO memory unit will be remembering the 1kHz signal level for some past period, depending on memory length. The time convolution on top can then look for temporal patterns such as sharp increase, or shallow decay of 1kHz energy over time. We can imagine this kind of information would be quite useful for processing speech.

The internal structure gives SVDF natural ability of capturing feature's temporal behavior. Conventional DSP use similar technique a lot, decomposing a signal into different frequency bands (features) then process them individually. The big difference now is, what feature to exact and processing to apply are all driven by ML.

Stacking multiple SVDF layers together would have the time receptive range multiplied. Such architecture quickly gained key position in keyword recognition solution, two good reasons:

Convenient end to end architecture
Great performance, surpassing the popular NN+HMM approach

领英推荐

477: One Thousand New Instructions with Kwabena Agyeman

Embeddedfm 10 个月前

Hello, Orca!

Picovoice 1 年前

A Problem Larger Than the Universe

Meinolf Sellmann 7 个月前

In my experience, SVDF is very good at grabbing low-hanging fruits, provides better performance in small footprint model. In fact, the tiny model featured in earlier tiny model post is SVDF based, simply 3 SVDF layers stacked together.

The SVDF layer implementation is based on the incredibly neat example from Google research, which makes use of DepthwiseConv2D to realize both the memory unit and time Conv. It is also a very good example of the “from batch to streaming” challenge too :-)

Appendix:

Model build summary

__________________________________________________________________________
 Layer (type)                Output Shape         Param #
=======================================================
 input_1 (InputLayer)           [(32, None)]         0           []

 lambda (Lambda)                (32, None, 64)       0           ['input_1[0][0]']

 conv1d (Conv1D)                (32, None, 64)       4096        ['lambda[0][0]']

 dense (Dense)                  (32, None, 72)       4608        ['conv1d[0][0]']

 dropout (Dropout)              (32, None, 72)       0           ['dense[0][0]']

 svdf__conv2d (SVDF_Conv2D)     (32, None, 112)      9184        ['dropout[0][0]']

 svdf__conv2d_1 (SVDF_Conv2D)   (32, None, 112)      13664       ['svdf__conv2d[0][0]']

 svdf__conv2d_2 (SVDF_Conv2D)   (32, None, 96)       11712       ['svdf__conv2d_1[0][0]']

 tf.slice (TFOpLambda)          (32, 2473, 72)       0           ['dense[0][0]']

 dense_1 (Dense)                (32, None, 72)       6912        ['svdf__conv2d_2[0][0]']

 multiply (Multiply)            (32, 2473, 72)       0           ['tf.slice[0][0]',
                                                                  'dense_1[0][0]']

 conv1d_transpose (Conv1DTransp  (32, 2473, 64)      4608        ['multiply[0][0]']
 ose)

 lambda_1 (Lambda)              (32, 158272)         0           ['conv1d_transpose[0][0]']

=======================================================
Total params: 54,784
Trainable params: 54,784
Non-trainable params: 0

要查看或添加评论，请登录

Weiming Li的更多文章

free trial: integrate NN processing in MCU with 2 lines of C code

2025年3月10日

free trial: integrate NN processing in MCU with 2 lines of C code

Trying is believing. In this post, I would enable everyone to be able to try bringing my example NN processing into…
Ray Tracing for sound, the holy grail for data generation?

2025年2月25日

Ray Tracing for sound, the holy grail for data generation?

Ray Tracing (RT) should be a very familiar term in 3D gaming, but what might be less known is its application in…
from minimize error to raise quality

2025年2月18日

from minimize error to raise quality

In this post, I am going to share the finding (and audio samples) of applying perceptual quality as training target for…
Looking forward to Cortex-M55 + Ethos-U55

2025年2月10日

Looking forward to Cortex-M55 + Ethos-U55

The 50x inference speed up and 25x efficiency jump are very exciting, but what I really look forward to is how it could…
Peek into the future

2025年1月13日

Peek into the future

The Devil is in the details, a often hidden small detail that we must not miss when interpreting performance figures…
Tiny model for tiny system

2025年1月6日

Tiny model for tiny system

Large model shows us the limitless perspective of what’s possible, but model doesn’t have to be big to do amazing…

6 条评论
build trust with black box

2024年12月29日

build trust with black box

Putting a black box in a product requires courage, a few ways to turn some of the courage into confidence. A NN model…
from batch to streaming

2024年12月19日

from batch to streaming

Unexpected complication I wish I were well aware of from the beginning. If you coming from a conventional DSP…
Fuzzy Memory

2024年12月16日

Fuzzy Memory

I don’t mean the kind we have after a hangover, but the kind powering some of the greatest models we know. “But do I…
Stochastic Rounding

2024年12月12日

Stochastic Rounding

When comes to digital signal, NN has the same liking as our ears. Rounding a number is a very common operation in DSP…

1 条评论

See all articles

SVDF, just give Conv a bit of time

Weiming Li

Machine Learning Signal Processing | MLSP.ai

领英推荐

Weiming Li的更多文章

社区洞察

其他会员也浏览了

Attention Maps of Vision Transformers

How the FFT algorithm works | Part 3 - The Inner Butterfly

Different DLS-based systems can give us different size and PDI results, and all of them can be correct! How is that possible?

Does Deepseek impact how the next iteration of models are built as Llama did?

7 - Level Up - Coherent Errors

Mask R-CNN A summary

What is not in this newsletter // eeNews Europe Newsletter 250207

Algorithm efficiency

It Isn't the Algorithm, It's the Data

Last Chance CFP

领英推荐

Weiming Li的更多文章

free trial: integrate NN processing in MCU with 2 lines of C code

Ray Tracing for sound, the holy grail for data generation?

from minimize error to raise quality

Looking forward to Cortex-M55 + Ethos-U55

Peek into the future

Tiny model for tiny system

build trust with black box

from batch to streaming

Fuzzy Memory

Stochastic Rounding

社区洞察

其他会员也浏览了

Attention Maps of Vision Transformers

How the FFT algorithm works | Part 3 - The Inner Butterfly

Different DLS-based systems can give us different size and PDI results, and all of them can be correct! How is that possible?

Does Deepseek impact how the next iteration of models are built as Llama did?

7 - Level Up - Coherent Errors

Mask R-CNN A summary

What is not in this newsletter // eeNews Europe Newsletter 250207

Algorithm efficiency

It Isn't the Algorithm, It's the Data

Last Chance CFP