free trial: integrate NN processing in MCU with 2 lines of C code
Trying is believing. In this post, I would enable everyone to be able to try bringing my example NN processing into your own embedded application with just 2 lines of C code, without compromise on efficiency of course.?
With the most popular embedded NN deployment framework being in C++ only and inherited from framework meant for much more powerful systems, there is a natural hurdle in front of embedded NN deployment and system integration. This hurdle is proven to be costly in efficiency, both engineering effort and CPU cycles wise. I would like to show a different possibility, one that not just provide superior efficiency, but also points a direction which embedded ML might be able to play catch up.
?
Trial information
The model: GRU (Gated Recurrent Unit) based speech general noise reduction, 119.2k parameters, model summary in appendix
In/output: a frame of 64 PCM samples, Q15 format, 16kHz sample rate, input max amplitude should be >-6dBFS
Processing precision: 16bit
Demo constrain: processing time limited to 120 sec
Target platform: ARM Cortex-M7
Test environment: ST STM32H7A3, arm-none-eabi-gcc 10.3
Preparation
The two lines of code
Include the header
#include "model_frame_proc.h"
then call the model to process a frame
model_frame_proc( p_data, p_temp_buf );
?
Efficiency
Memory usage: 248kB code+data, 2kB scratch buffer
CPU usage: 86Mhz when processing 16kHz stream realtime, equivalent to 344k cycles per inference
Test environment: STM32H7A3, D/I cache enabled, -Ofast compiler flag
?
HOW
There is no magic:
I hope this shows a different perspective and could help driving embedded ML forward. If you think this could help your project or algorithm, get in touch.
Appendix:
Model summary
___________________________________________
Layer (type) Output Shape Param #
===========================================
multiply (Multiply) multiple 0
conv1d (Conv1D) multiple 4096
dense (Dense) multiple 5200
gru (GRU) multiple 51264
conv1d_1 (Conv1D) multiple 8536
gru_1 (GRU) multiple 40800
dense_1 (Dense) multiple 5184
conv1d_transpose multiple 4096
===========================================
Total params: 119,176
Trainable params: 119,176
Non-trainable params: 0