登录查看更多内容

Part 1 - Generalized Architecture for LLM API's in Client Applications

Kedar Bhumkar

Salesforce,Java and MuleSoft Architect, Dreamforce speaker

发布日期: 2023年12月12日

The below diagram presents a generalized inference architecture designed for the integration of Language Model (LLM) APIs within a client application. The primary motivation is to establish a repeatable, component-based design that remains agnostic to the underlying LLM use cases.

The architecture is divided into 2 parts

Inference lane
Training lane

This article focuses on the Inference lane.

Client Application Interaction

The client app communicates either in real-time or asynchronously via Kafka topics to the Python-based backend. Asynchronous processing is typically required for batch-based inference, such as computing risk levels for all patients in a nightly job and alerting the APCs if they fall into 'Higher' risk brackets. Client app is responsible for sending the request which contains at least 2 things

领英推荐

Tool-based Agent Pattern

Doug Ware 2 个月前

Separating Workflow from Computation: A Lesson from…

Jimmy Pang 2 个月前

Understanding Big O Notation - A Must for Developers

Girish Vas 5 个月前

The LLM request - comprising the use case name
The prompt variables - derived from business objects

Backend Connectivity

The backend connects to various layers as described below:

LLM API Layer:This layer abstracts connections to different LLMs, such as GPT 3.5/4, LLAMA 2, or Mistral models based on specific requirements.
Rate limiter: Most LLM's have limits on no of requests/min per day. Hence the client apps need to be considerate of how much the LLM's could be hit in a given time frame
Postgres DB:Holds the LLM JSON configuration for a specific use case. The configuration includes LLM source, prompt parameters, request-to-prompt variable mapping, RAG (Retrieval Augmented Generation via vector DB's) index, sync/async API details, etc.
Connections:Most LLMs have a secret key that needs to be stored in a KeyVault with controlled access.
Vector DB:Depending on the use case, a vector DB like Pinecone is required when using RAG. Vector indices are created beforehand using the source data and referenced via the configuration.
Redis Cache Layer:This layer can cache the LLM response if the same or similar prompt is requested to be executed. This avoids the round trip to the LLM servers.
Tools:Different extension toolsets like Langchain and the newly launched OpenAi Agent framework can be utilized based on the use case and configured accordingly.

General Inference Architecture Considerations

Any general inference architecture for LLMs must account for the 3 Rs:

Reliability: How accurate are the responses from the LLM for the use case?
Repeatability: Are responses idempotent?
Responsiveness: Can we impose upper/lower bounds on response timings?

In the next part, we will see more details on the Training lane.

要查看或添加评论，请登录

Kedar Bhumkar的更多文章

Part 2 - Training data for LLM fine tuning

2023年12月18日

Part 2 - Training data for LLM fine tuning

In part 1 of this series, a general LLM architecture for APIs and inference overview was presented. In this article, we…

1 条评论
Github copilot runs on LWC

2021年12月20日

Github copilot runs on LWC

Recently, I got access to the GitHub copilot Beta program. GitHub copilot is the GPT3 based AI pair programming tool…

2 条评论
Performance tracking using the chrome network tab

2021年5月10日

Performance tracking using the chrome network tab

The Chrome network tab is a powerful arsenal for performance tracking any web application including Salesforce. In this…
Leveraging Kafka for ETL operations within Salesforce

2021年4月9日

Leveraging Kafka for ETL operations within Salesforce

Traditional ETL is slowly getting replaced within the enterprise landscape with real time event based ETL. The classic…

1 条评论
Active and passive API callout strategies

2020年12月30日

Active and passive API callout strategies

API failure management is critical and needs to be designed properly to ensure the API calls are successfully made as…
Using Platform events to overcome async operation limits

2020年12月22日

Using Platform events to overcome async operation limits

Platform events (PE) are great for what they are designed for; which is event based loose coupling and processing…

1 条评论
Using AWS for ML insights inside Salesforce

2020年11月28日

Using AWS for ML insights inside Salesforce

Salesforce Einstein is a fairly capable predictive analytics tool available for use within Salesforce. However it has…
Simulating DB locking errors

2020年11月12日

Simulating DB locking errors

"Unable to lock row" is a peculiar and difficult to simulate error which often occurs when multiple requests try to…
Offloading batch processing using Map-reduce

2020年11月10日

Offloading batch processing using Map-reduce

Recently, I was involved in a large scale migration operation on Salesforce. It was done using the Salesforce Batch…

1 条评论
Enhanced field history tracking

2020年6月10日

Enhanced field history tracking

Salesforce has a robust history tracking feature which is used by most projects to track the timeline of changes done…

2 条评论

See all articles

Part 1 - Generalized Architecture for LLM API's in Client Applications

Kedar Bhumkar

Salesforce,Java and MuleSoft Architect, Dreamforce speaker

Client Application Interaction

领英推荐

Backend Connectivity

General Inference Architecture Considerations

Kedar Bhumkar的更多文章

社区洞察

其他会员也浏览了

Mastering Observability with OpenTelemetry and Grafana for FastAPI Applications

Article 3.1: Multi-Agent Framework with Firmware, Smart Library, and Semantic Search

How to make the concrete abstract?

Building the “House of Wisdom”: Behind the Scenes of a Knowledge Integration App

DS Fortune Cookies: FTI Architecture

Optimizing Multithreading in Node.js: A Practical Guide

How Vector Remap Enhances Log Data Parsing and Storage in Observability

Building/Designing Highly Scalable and Robust Application Part (Scaling Application) - 2

Exploring LLM-RAG Architecture for Enterprise Test Case Generation: A Promising Approach

Getting started with collapsing the stack: Sources on knowledge graphs, data-centric architecture, and graph-driven development

Client Application Interaction

领英推荐

Backend Connectivity

General Inference Architecture Considerations

Kedar Bhumkar的更多文章

Part 2 - Training data for LLM fine tuning

Github copilot runs on LWC

Performance tracking using the chrome network tab

Leveraging Kafka for ETL operations within Salesforce

Active and passive API callout strategies

Using Platform events to overcome async operation limits

Using AWS for ML insights inside Salesforce

Simulating DB locking errors

Offloading batch processing using Map-reduce

Enhanced field history tracking

社区洞察

其他会员也浏览了

Mastering Observability with OpenTelemetry and Grafana for FastAPI Applications

Article 3.1: Multi-Agent Framework with Firmware, Smart Library, and Semantic Search

How to make the concrete abstract?

Building the “House of Wisdom”: Behind the Scenes of a Knowledge Integration App

DS Fortune Cookies: FTI Architecture

Optimizing Multithreading in Node.js: A Practical Guide

How Vector Remap Enhances Log Data Parsing and Storage in Observability

Building/Designing Highly Scalable and Robust Application Part (Scaling Application) - 2

Exploring LLM-RAG Architecture for Enterprise Test Case Generation: A Promising Approach

Getting started with collapsing the stack: Sources on knowledge graphs, data-centric architecture, and graph-driven development