SVDF, just give Conv a bit of time
Simply add a dimension of time to standard Conv layer, it becomes the SVDF layer, the core component powering our favorite keyword recognition.
Here is how SVDF works, have to lend help from 3D drawing to highlight the different dimensions.
In one sentence, it’s a FIFO (First-In-First-Out) memory sandwiched by two standard Conv layers. Let’s add some context to help grasping the idea behind. Assuming the input is speech data and the feature convolution extracts 1kHz signal (like a FIR filter) from the input, then the FIFO memory unit will be remembering the 1kHz signal level for some past period, depending on memory length. The time convolution on top can then look for temporal patterns such as sharp increase, or shallow decay of 1kHz energy over time. We can imagine this kind of information would be quite useful for processing speech.
The internal structure gives SVDF natural ability of capturing feature's temporal behavior. Conventional DSP use similar technique a lot, decomposing a signal into different frequency bands (features) then process them individually. The big difference now is, what feature to exact and processing to apply are all driven by ML.
Stacking multiple SVDF layers together would have the time receptive range multiplied. Such architecture quickly gained key position in keyword recognition solution, two good reasons:
领英推荐
In my experience, SVDF is very good at grabbing low-hanging fruits, provides better performance in small footprint model. In fact, the tiny model featured in earlier tiny model post is SVDF based, simply 3 SVDF layers stacked together.
The SVDF layer implementation is based on the incredibly neat example from Google research, which makes use of DepthwiseConv2D to realize both the memory unit and time Conv. It is also a very good example of the “from batch to streaming” challenge too :-)
Appendix:
Model build summary
__________________________________________________________________________
Layer (type) Output Shape Param #
=======================================================
input_1 (InputLayer) [(32, None)] 0 []
lambda (Lambda) (32, None, 64) 0 ['input_1[0][0]']
conv1d (Conv1D) (32, None, 64) 4096 ['lambda[0][0]']
dense (Dense) (32, None, 72) 4608 ['conv1d[0][0]']
dropout (Dropout) (32, None, 72) 0 ['dense[0][0]']
svdf__conv2d (SVDF_Conv2D) (32, None, 112) 9184 ['dropout[0][0]']
svdf__conv2d_1 (SVDF_Conv2D) (32, None, 112) 13664 ['svdf__conv2d[0][0]']
svdf__conv2d_2 (SVDF_Conv2D) (32, None, 96) 11712 ['svdf__conv2d_1[0][0]']
tf.slice (TFOpLambda) (32, 2473, 72) 0 ['dense[0][0]']
dense_1 (Dense) (32, None, 72) 6912 ['svdf__conv2d_2[0][0]']
multiply (Multiply) (32, 2473, 72) 0 ['tf.slice[0][0]',
'dense_1[0][0]']
conv1d_transpose (Conv1DTransp (32, 2473, 64) 4608 ['multiply[0][0]']
ose)
lambda_1 (Lambda) (32, 158272) 0 ['conv1d_transpose[0][0]']
=======================================================
Total params: 54,784
Trainable params: 54,784
Non-trainable params: 0