Hardware evaluation and Real time computer vision + ML on Edge
Hardware evaluation for fast FPS video processing:
Compare Semantic Segmentation on Jetson Nano and Xavier
How to increase speed of deep learning models for computer vision application
Machine learning on edge devices requires tailored hardware/software solutions.
Several methods for increasing speed and reduce memory usages are: Ensemble Methods, Distributed machine learning and load balancing strategy, Low rank matrix factorization (LRMF), Compact convolutional filters (Video/CNN), pruning, Quantization, Knowledge distillation, Neural Networks Compression Framework (NNCF), Binarized Neural Networks (BNNs), Apache TVM, etc .
Hardware:
RISC-V, FPGA, CORE-V, eFPGA, Raspberry pi 3, Raspberry pi 4, Intel? Neural Compute Stick 2, OpenCV AI Kit, Google Coral, NVIDIA Jetson Nano, NVIDIA JETSON AGX XAVIER.
Categories of processing AI on Edge
- Edge sensing with cloud AI: data may be filtered, compressed, or pre-processed at edge
- Edge AI with cloud data upstream: results send cloud; AI inference at the edge for efficiency, latency, cost, or scale
- Edge AI real time interactive system: everything on edge
Benefits of ML + IoT = low latency, low bandwidth, low power consumption, high availability, data privacy and security
- Edge + AI accelerators
- Edge + embedded AI chips
ML/AI Pipeline for Edge:
- Data acquisition
- Data processing
- Feature extraction
- Model training
- Model conversion
- Training & conversion
- Model deployment
Pruning
Compress models by reducing the number of parameters in them is important in order to reduce memory, battery, and hardware consumption without sacrificing accuracy, deploy lightweight models on device, and guarantee privacy with private on-device computation. pruning is used to investigate the differences in learning dynamics between over-parametrized and under-parametrized networks, to study the role of lucky sparse subnetworks and initializations (“lottery tickets”) as a destructive neural architecture search technique. [1]
?? Tip: go for bigger size of network with many layers then pruning much better and faster.
Quantization
Quantization leverages 8bit integer (int8) instructions to reduce the model size and run the inference faster (reduced latency) and can be the difference between a model achieving quality of service goals or even fitting into the resources available on a mobile device. Even when resources aren’t quite so constrained it may enable you to deploy a larger and more accurate model. [2]
?? Tip: check your hardware/accelerator because some hardware does not support it.
Ensemble methods:
feature-based, compact representation, easy to reduce model size, model size can be reduced post-training
bagging is a type of ensemble method where multiple models are trained in parallel on subsampled datasets (reduces error due to variance); many models to combine output to make a single classification; every model get pictures on dataset , over fit, but combine get same accuracy, picture the noise is cancelled for each model.
boosting is a type of ensemble method where multiple models are trained in sequence to improve upon the errors of the previous model (reduces error due to bias)
Distillation Techniques
Model Knowledge distillation is a method used to reduce the size of a model without loosing too much of its predictive powers.
Binarized Neural Networks (BNNs)
- XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks: https://arxiv.org/abs/1603.05279.
- Binarized Neural Networks :https://papers.nips.cc/paper/6573-binarized-neural-networks
- DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients :https://arxiv.org/abs/1606.06160
It doesn't support by GPU hardware such as Jetson Nano.
Apache TVM: An End to End Machine Learning Compiler Framework for CPUs, GPUs and accelerators
Apache TVM is an open source machine learning compiler framework for CPUs, GPUs, and machine learning accelerators. It aims to enable machine learning engineers to optimize and run computations efficiently on any hardware backend.
Compact convolutional filters (Video/CNN)
designing special structural convolutional filters to save parameters. Replace over parametric filters with compact filters to achieve overall speedup while maintaining comparable accuracy.
Distributed machine learning and load balancing strategy
Run models which use all processing power like CPU,GPU,DSP,AI chip together to enhance inference performance
Frameworks / Libraries for Edge/IoT
- Apache TVM
- Intel? Distribution of OpenVINO Toolkit
- CoreML
- ML kit
- FRITZ
- MediaPipe
- Docker for embedded
- tinyML
- NVIDIA TensorRT
- TensorFlow Lite
- TensorFlow.js
- PyTorch Lightning
- PyTorch Mobile
- Code size compiler optimizations and techniques for embedded systems
- Useful library for programming
ffmpeg; GStreamer celery
GPU library for python: PyCUDA, NumbaPro, PyOpenCL, CuPy
Other techniques
Low rank matrix factorization (LRMF)
Knowledge distillation
Neural Networks Compression Framework (NNCF) [3]
PHD Candidate | Interested in AI for Mental Health | Wearables | Digital Health | HCAI | Ubiquitous Computing | Data-centric AI
3 年Thanks for sharing.This is an intresting read for beginners in edge ML/AI. It would be great if you could mentor me on getting started.