登录查看更多内容

free trial: integrate NN processing in MCU with 2 lines of C code

Weiming Li

Machine Learning Signal Processing | MLSP.ai

发布日期: 2025年3月10日

Trying is believing. In this post, I would enable everyone to be able to try bringing my example NN processing into your own embedded application with just 2 lines of C code, without compromise on efficiency of course.?

With the most popular embedded NN deployment framework being in C++ only and inherited from framework meant for much more powerful systems, there is a natural hurdle in front of embedded NN deployment and system integration. This hurdle is proven to be costly in efficiency, both engineering effort and CPU cycles wise. I would like to show a different possibility, one that not just provide superior efficiency, but also points a direction which embedded ML might be able to play catch up.

Trial information

The model: GRU (Gated Recurrent Unit) based speech general noise reduction, 119.2k parameters, model summary in appendix

In/output: a frame of 64 PCM samples, Q15 format, 16kHz sample rate, input max amplitude should be >-6dBFS

Processing precision: 16bit

Demo constrain: processing time limited to 120 sec

Target platform: ARM Cortex-M7

Test environment: ST STM32H7A3, arm-none-eabi-gcc 10.3

Preparation

Download and unzip the library (provided for evaluation purpose only)
Place the .h file in project include folder
Place the .a file in library folder and include lib name “model_processing” in linker, by using “-L” & ”-l” flags or in Cube IDE project setting as following. Note the “lib” and “.a” would be added automatically.

The two lines of code

Include the header

#include "model_frame_proc.h"

then call the model to process a frame

model_frame_proc( p_data, p_temp_buf );

Efficiency

Memory usage: 248kB code+data, 2kB scratch buffer

CPU usage: 86Mhz when processing 16kHz stream realtime, equivalent to 344k cycles per inference

Test environment: STM32H7A3, D/I cache enabled, -Ofast compiler flag

HOW

There is no magic:

The silky integration come from same concept as Rust, build all dependencies into the static library.
The efficiency is achieved by thoughtfully translating NN layers into most basic operations understood by the compiler.

I hope this shows a different perspective and could help driving embedded ML forward. If you think this could help your project or algorithm, get in touch.

Appendix:

Model summary
___________________________________________
 Layer (type)      Output Shape     Param #
===========================================
 multiply (Multiply)   multiple     0
 conv1d (Conv1D)       multiple     4096
 dense (Dense)         multiple     5200
 gru (GRU)             multiple     51264
 conv1d_1 (Conv1D)     multiple     8536
 gru_1 (GRU)           multiple     40800
 dense_1 (Dense)       multiple     5184
 conv1d_transpose      multiple     4096
===========================================
Total params: 119,176
Trainable params: 119,176
Non-trainable params: 0

要查看或添加评论，请登录

Weiming Li的更多文章

Ray Tracing for sound, the holy grail for data generation?

2025年2月25日

Ray Tracing for sound, the holy grail for data generation?

Ray Tracing (RT) should be a very familiar term in 3D gaming, but what might be less known is its application in…
from minimize error to raise quality

2025年2月18日

from minimize error to raise quality

In this post, I am going to share the finding (and audio samples) of applying perceptual quality as training target for…
Looking forward to Cortex-M55 + Ethos-U55

2025年2月10日

Looking forward to Cortex-M55 + Ethos-U55

The 50x inference speed up and 25x efficiency jump are very exciting, but what I really look forward to is how it could…
SVDF, just give Conv a bit of time

2025年1月19日

SVDF, just give Conv a bit of time

Simply add a dimension of time to standard Conv layer, it becomes the SVDF layer, the core component powering our…
Peek into the future

2025年1月13日

Peek into the future

The Devil is in the details, a often hidden small detail that we must not miss when interpreting performance figures…
Tiny model for tiny system

2025年1月6日

Tiny model for tiny system

Large model shows us the limitless perspective of what’s possible, but model doesn’t have to be big to do amazing…

6 条评论
build trust with black box

2024年12月29日

build trust with black box

Putting a black box in a product requires courage, a few ways to turn some of the courage into confidence. A NN model…
from batch to streaming

2024年12月19日

from batch to streaming

Unexpected complication I wish I were well aware of from the beginning. If you coming from a conventional DSP…
Fuzzy Memory

2024年12月16日

Fuzzy Memory

I don’t mean the kind we have after a hangover, but the kind powering some of the greatest models we know. “But do I…
Stochastic Rounding

2024年12月12日

Stochastic Rounding

When comes to digital signal, NN has the same liking as our ears. Rounding a number is a very common operation in DSP…

1 条评论

See all articles

Trial information

Preparation

The two lines of code

Efficiency

HOW

Weiming Li的更多文章

Ray Tracing for sound, the holy grail for data generation?

from minimize error to raise quality

Looking forward to Cortex-M55 + Ethos-U55

SVDF, just give Conv a bit of time

Peek into the future

Tiny model for tiny system

build trust with black box

from batch to streaming

Fuzzy Memory

Stochastic Rounding

社区洞察