Embedded Development: Improving Embedded Artificial Intelligence Performance with Edge AI Processors
With the rapid growth of artificial intelligence, embedded AI solutions have become achievable for relatively low requirements. But for video frames and image resolutions that need to be able to handle up to 4kp60, traditional solutions that rely on a fixed platform are no longer capable of doing so. This paper presents a combination of Kinara's gas pedals and NXP processors to deliver edge AI performance that enables the perfect high-speed performance required for parallel processing of multiple smart cameras.
The arrival of artificial intelligence (AI) in embedded computing has led to a proliferation of potential solutions designed to deliver the high performance required to perform neural network inference on high-speed streaming video. While many reference requirements (such as ImageNet) are relatively low resolution and thus achievable with a wide range of embedded AI solutions, many real-world applications in retail, healthcare, security and industrial control require video frames and images that can be processed at resolutions as high as 4kp60 and higher.
Scalability is critical, but it is not always an arbitrary choice for system-on-chip (SoC) platforms that offer only a fixed combination of host processor and neural gas pedal. While a way to evaluate the performance of different forms of neural networks is also usually provided during prototype modeling, such integrated implementations lack the granularity and scalability that real systems typically require. In this case, industrial-grade AI applications benefit from a more balanced architecture in which multiple heterogeneous processors (e.g., CPUs, GPUs) and gas pedals are combined to work together in an integrated pipeline that not only performs inference on raw video frames, but also optimizes the overall result or processing format transformation using pre-processing and post-processing to be able to handle multiple types of cameras and sensors.
The classic deployment scenarios lie in smart cameras and edge AI devices. For the former, vision processing and neural network inference support functions need to be integrated into the main camera board. The camera may also need to perform a number of other tasks, such as counting the number of people in a room and being able to avoid double counting the subject as it enters and leaves the field of view. The smart camera must not only be able to identify the person, but must also be able to re-identify the person based on the data the camera has already processed so that it does not double count. This requires a flexible image processing and inference pipeline where applications can handle basic object recognition as well as complex inference-based tasks such as re-identification.
Building Smart Cameras and Edge AI Devices
Typically, in a smart camera design, the host processor converts the sensor input into a form suitable for inference, including: adapting, cropping, and normalizing the data frames to make them suitable for high-throughput inference. A similar but more highly integrated use case is an edge AI device. This device needs to process inputs from multiple networked sensors and cameras, and therefore needs to have the ability to process multiple compressed (or encoded) video streams simultaneously. In this multi-camera scenario, the processing power must be able to scale to handle the formatting, color space, and other transformations needed to perform inference, and to handle multiple parallel inferences.
While fixed SoC implementations are capable of handling specific use cases, based on the need for scalability, attention is turning to platforms with the ability to scale due to their ability to meet different requirements and provide inherent support for scalability and upgrades as customer needs change. Therefore, it is important to focus on platforms that can easily extend the functionality of the hardware so that when the needs of a particular device utilizing a different architecture change, there are no major changes to the code. Because few people can afford the porting overhead implied by this.
领英推荐
Many developers have adopted embedded processing platforms from vendors such as NXP and Qualcomm because of the many options they offer in terms of performance, functionality and price. For example, the NXP i.MX application processor meets a wide range of performance requirements. Unlike fixed SoC platforms, NXP's processor family benefits from the long-term vendor support and availability guarantees necessary for many embedded computing markets. devices such as the i.MX 8M provide a good foundation for edge AI device requirements. Its built-in video decode acceleration allows it to support four compressed 1080p video streams on a single processor. By pairing the i.MX application processor with Kinara's Ara-1 gas pedal, it is possible to implement inference on multiple video streams or have the ability to handle complex models.
Running multiple modeling requirements
In the main processor, each gas pedal can run multiple AI models on each switch-free time and zero-load frame, providing the ability to perform complex tasks in real time. Unlike some inference pipelines that rely on multi-frame batch processing for maximum throughput, Ara-1 is optimized specifically for 1 batch and for maximum responsiveness.
This means that the smart camera design does not need to rely on the main processor to perform re-identification algorithms if the gas pedal is performing inference on another frame or a portion of a frame. Both can be offloaded to the Ara-1 to take advantage of its higher speed. Where more performance is needed, such as in edge AI devices, where different multiple applications may all need to perform inference tasks, multiple gas pedals can be used in parallel.
Higher scalability can be achieved not only by supporting chip-down integration on the smart camera or device PCB, but also by supporting plug-in upgrades. For chip-down integration, Ara-1 supports industry standard and high-bandwidth PCIe interfaces for easy connection to host processors that include PCIe Gen 3 interfaces. The second integration path utilizes a module that plugs directly into an upgradeable motherboard, leveraging the PCIe interface and providing the ability to handle up to 16 camera inputs. For some systems and prototypes using off-the-shelf hardware, there is another option that inherently supports USB 3.2. Using a simple cable connection, AI algorithms can be tested on a laptop, production can be started with a hardware evaluation kit, or an existing system can be simply upgraded.
Software infrastructure for seamless transition
Developers can choose from a variety of approaches to simplify the integration of gas pedals with processors and their associated software stacks. For model deployment and management, the runtime leverages C++ or the increasingly popular Python application programming interface (API), running in Arm's Linux environment or x86 Windows environment. kinara's runtime API supports a variety of commands, including loading and unloading models, passing model input, receiving inference data, and inference and hardware device All control of the hardware devices.
The GStreamer environment provides an alternative way to access the performance of the gas pedal. As a library designed for building computational graphics for media processing components, GStreamer makes it easy to implement filtering pipelines that can be embedded in some of the more complex applications capable of reacting to changes in the state of imported video and sensor feeds.
For AI inference, SDKs such as Kinara can take many different forms of training models, including TensorFlow, PyTorch, ONNX, Caffe2, and MXNet, and directly support hundreds of models such as YOLO, TFPose, EfficientNet, and transformer networks. Thus providing a complete environment to optimize performance by means of leveraging quantization, using automatic tuning to ensure model accuracy is maintained, and scheduling execution at runtime. With such a platform, it is possible to gain insight into model execution to facilitate performance optimization and parameter tuning. Engineers can leverage accurate simulators to evaluate performance prior to silicon implementation.
In summary, as AI becomes an increasingly integral part of embedded systems, it is important to be able to integrate reasoning capabilities into a wide range of platforms to meet evolving needs. This means being able to deploy flexible gas pedals with associated SDKs that allow customers to combine advanced AI acceleration with existing or new embedded systems.
Editor:Jimmy.zhang