Part 1 - Generalized Architecture for LLM API's in Client Applications
The below diagram presents a generalized inference architecture designed for the integration of Language Model (LLM) APIs within a client application. The primary motivation is to establish a repeatable, component-based design that remains agnostic to the underlying LLM use cases.
The architecture is divided into 2 parts
This article focuses on the Inference lane.
Client Application Interaction
The client app communicates either in real-time or asynchronously via Kafka topics to the Python-based backend. Asynchronous processing is typically required for batch-based inference, such as computing risk levels for all patients in a nightly job and alerting the APCs if they fall into 'Higher' risk brackets. Client app is responsible for sending the request which contains at least 2 things
领英推荐
Backend Connectivity
The backend connects to various layers as described below:
General Inference Architecture Considerations
Any general inference architecture for LLMs must account for the 3 Rs:
In the next part, we will see more details on the Training lane.