Part 1 - Generalized Architecture for LLM API's in Client Applications

Part 1 - Generalized Architecture for LLM API's in Client Applications

The below diagram presents a generalized inference architecture designed for the integration of Language Model (LLM) APIs within a client application. The primary motivation is to establish a repeatable, component-based design that remains agnostic to the underlying LLM use cases.

The architecture is divided into 2 parts

  • Inference lane
  • Training lane

This article focuses on the Inference lane.

Generalized Inference LLM architecture


Client Application Interaction

The client app communicates either in real-time or asynchronously via Kafka topics to the Python-based backend. Asynchronous processing is typically required for batch-based inference, such as computing risk levels for all patients in a nightly job and alerting the APCs if they fall into 'Higher' risk brackets. Client app is responsible for sending the request which contains at least 2 things

  • The LLM request - comprising the use case name
  • The prompt variables - derived from business objects

Backend Connectivity

The backend connects to various layers as described below:

  1. LLM API Layer:This layer abstracts connections to different LLMs, such as GPT 3.5/4, LLAMA 2, or Mistral models based on specific requirements.
  2. Rate limiter: Most LLM's have limits on no of requests/min per day. Hence the client apps need to be considerate of how much the LLM's could be hit in a given time frame
  3. Postgres DB:Holds the LLM JSON configuration for a specific use case. The configuration includes LLM source, prompt parameters, request-to-prompt variable mapping, RAG (Retrieval Augmented Generation via vector DB's) index, sync/async API details, etc.
  4. Connections:Most LLMs have a secret key that needs to be stored in a KeyVault with controlled access.
  5. Vector DB:Depending on the use case, a vector DB like Pinecone is required when using RAG. Vector indices are created beforehand using the source data and referenced via the configuration.
  6. Redis Cache Layer:This layer can cache the LLM response if the same or similar prompt is requested to be executed. This avoids the round trip to the LLM servers.
  7. Tools:Different extension toolsets like Langchain and the newly launched OpenAi Agent framework can be utilized based on the use case and configured accordingly.

General Inference Architecture Considerations

Any general inference architecture for LLMs must account for the 3 Rs:

  • Reliability: How accurate are the responses from the LLM for the use case?
  • Repeatability: Are responses idempotent?
  • Responsiveness: Can we impose upper/lower bounds on response timings?

In the next part, we will see more details on the Training lane.

要查看或添加评论,请登录

Kedar Bhumkar的更多文章

  • Part 2 - Training data for LLM fine tuning

    Part 2 - Training data for LLM fine tuning

    In part 1 of this series, a general LLM architecture for APIs and inference overview was presented. In this article, we…

    1 条评论
  • Github copilot runs on LWC

    Github copilot runs on LWC

    Recently, I got access to the GitHub copilot Beta program. GitHub copilot is the GPT3 based AI pair programming tool…

    2 条评论
  • Performance tracking using the chrome network tab

    Performance tracking using the chrome network tab

    The Chrome network tab is a powerful arsenal for performance tracking any web application including Salesforce. In this…

  • Leveraging Kafka for ETL operations within Salesforce

    Leveraging Kafka for ETL operations within Salesforce

    Traditional ETL is slowly getting replaced within the enterprise landscape with real time event based ETL. The classic…

    1 条评论
  • Active and passive API callout strategies

    Active and passive API callout strategies

    API failure management is critical and needs to be designed properly to ensure the API calls are successfully made as…

  • Using Platform events to overcome async operation limits

    Using Platform events to overcome async operation limits

    Platform events (PE) are great for what they are designed for; which is event based loose coupling and processing…

    1 条评论
  • Using AWS for ML insights inside Salesforce

    Using AWS for ML insights inside Salesforce

    Salesforce Einstein is a fairly capable predictive analytics tool available for use within Salesforce. However it has…

  • Simulating DB locking errors

    Simulating DB locking errors

    "Unable to lock row" is a peculiar and difficult to simulate error which often occurs when multiple requests try to…

  • Offloading batch processing using Map-reduce

    Offloading batch processing using Map-reduce

    Recently, I was involved in a large scale migration operation on Salesforce. It was done using the Salesforce Batch…

    1 条评论
  • Enhanced field history tracking

    Enhanced field history tracking

    Salesforce has a robust history tracking feature which is used by most projects to track the timeline of changes done…

    2 条评论

社区洞察

其他会员也浏览了