free trial: integrate NN processing in MCU with 2 lines of C code

free trial: integrate NN processing in MCU with 2 lines of C code

Trying is believing. In this post, I would enable everyone to be able to try bringing my example NN processing into your own embedded application with just 2 lines of C code, without compromise on efficiency of course.?

With the most popular embedded NN deployment framework being in C++ only and inherited from framework meant for much more powerful systems, there is a natural hurdle in front of embedded NN deployment and system integration. This hurdle is proven to be costly in efficiency, both engineering effort and CPU cycles wise. I would like to show a different possibility, one that not just provide superior efficiency, but also points a direction which embedded ML might be able to play catch up.

?

Trial information

The model: GRU (Gated Recurrent Unit) based speech general noise reduction, 119.2k parameters, model summary in appendix

In/output: a frame of 64 PCM samples, Q15 format, 16kHz sample rate, input max amplitude should be >-6dBFS

Processing precision: 16bit

Demo constrain: processing time limited to 120 sec

Target platform: ARM Cortex-M7

Test environment: ST STM32H7A3, arm-none-eabi-gcc 10.3


Preparation

  • Download and unzip the library (provided for evaluation purpose only)
  • Place the .h file in project include folder
  • Place the .a file in library folder and include lib name “model_processing” in linker, by using “-L” & ”-l” flags or in Cube IDE project setting as following. Note the “lib” and “.a” would be added automatically.

The two lines of code

Include the header

#include "model_frame_proc.h"        

then call the model to process a frame

model_frame_proc( p_data, p_temp_buf );          

?

Efficiency

Memory usage: 248kB code+data, 2kB scratch buffer

CPU usage: 86Mhz when processing 16kHz stream realtime, equivalent to 344k cycles per inference

Test environment: STM32H7A3, D/I cache enabled, -Ofast compiler flag

?

HOW

There is no magic:

  • The silky integration come from same concept as Rust, build all dependencies into the static library.
  • The efficiency is achieved by thoughtfully translating NN layers into most basic operations understood by the compiler.


I hope this shows a different perspective and could help driving embedded ML forward. If you think this could help your project or algorithm, get in touch.



Appendix:

Model summary
___________________________________________
 Layer (type)      Output Shape     Param #
===========================================
 multiply (Multiply)   multiple     0
 conv1d (Conv1D)       multiple     4096
 dense (Dense)         multiple     5200
 gru (GRU)             multiple     51264
 conv1d_1 (Conv1D)     multiple     8536
 gru_1 (GRU)           multiple     40800
 dense_1 (Dense)       multiple     5184
 conv1d_transpose      multiple     4096
===========================================
Total params: 119,176
Trainable params: 119,176
Non-trainable params: 0        

要查看或添加评论,请登录

Weiming Li的更多文章

  • Ray Tracing for sound, the holy grail for data generation?

    Ray Tracing for sound, the holy grail for data generation?

    Ray Tracing (RT) should be a very familiar term in 3D gaming, but what might be less known is its application in…

  • from minimize error to raise quality

    from minimize error to raise quality

    In this post, I am going to share the finding (and audio samples) of applying perceptual quality as training target for…

  • Looking forward to Cortex-M55 + Ethos-U55

    Looking forward to Cortex-M55 + Ethos-U55

    The 50x inference speed up and 25x efficiency jump are very exciting, but what I really look forward to is how it could…

  • SVDF, just give Conv a bit of time

    SVDF, just give Conv a bit of time

    Simply add a dimension of time to standard Conv layer, it becomes the SVDF layer, the core component powering our…

  • Peek into the future

    Peek into the future

    The Devil is in the details, a often hidden small detail that we must not miss when interpreting performance figures…

  • Tiny model for tiny system

    Tiny model for tiny system

    Large model shows us the limitless perspective of what’s possible, but model doesn’t have to be big to do amazing…

    6 条评论
  • build trust with black box

    build trust with black box

    Putting a black box in a product requires courage, a few ways to turn some of the courage into confidence. A NN model…

  • from batch to streaming

    from batch to streaming

    Unexpected complication I wish I were well aware of from the beginning. If you coming from a conventional DSP…

  • Fuzzy Memory

    Fuzzy Memory

    I don’t mean the kind we have after a hangover, but the kind powering some of the greatest models we know. “But do I…

  • Stochastic Rounding

    Stochastic Rounding

    When comes to digital signal, NN has the same liking as our ears. Rounding a number is a very common operation in DSP…

    1 条评论

社区洞察