Stateful and Responsible AI?Agents

Stateful and Responsible AI?Agents

Introduction to AI?Agents

The discussion around ChatGPT, has now evolved into AutoGPT. While ChatGPT is primarily a chatbot that can generate text responses, AutoGPT is a more powerful and autonomous AI agent that can execute complex tasks, e.g., make a sale, plan a trip, make a flight booking, book a contractor to do a house job, order a pizza.

Bill Gates recently envisioned a future where we would have an AI agent that is able to process and respond to natural language and accomplish a number of different tasks. Gates used planning a trip as an example.?

Ordinarily, this would involve booking your hotel, flights, restaurants, etc. on your own. But an AI agent would be able to use its knowledge of your preferences to book and purchase those things on your?behalf.

AI agents [1] follow a long history of research around Multi-agent Systems (MAS) [2], esp., Goal oriented Agents [3]. However, designing and deploying AI agents remains c hallenging in practice. In this article, we focus on primarily two aspects of AI agent platforms:

  • given the complex and long-running nature of AI agents, we discuss approaches to ensure a reliable and stateful AI agent execution.
  • adding the Responsible AI dimension to AI agents. We highlight issues specific to AI agents and propose approaches to establish an integrated AI agent platform governed by Responsible AI practices.

Agent AI Platform Reference Architecture

In this section, we focus on identifying the key components of a reference AI agent platform:

  • Agent marketplace
  • Orchestration layer
  • Integration layer
  • Shared memory layer
  • Governance layer, including explainability, privacy, security, etc.

Fig 1: AI agent platform reference architecture

Given a user task, the goal of an AI agent platform is to identify (compose) an agent (group of agents) capable to executing the given task. So the first component we need is an orchestration layer capable of decomposing a task into sub-tasks, with execution of the respective agents orchestrated by an orchestration engine.

A high-level approach to solving such complex tasks involves: (a) decomposition of the given complex task into (a hierarchy or workflow of) simple tasks, followed by (b) composition of agents able to execute the simple(r) tasks. This can be achieved in a dynamic or static manner. In the dynamic approach, given a complex user task, the system comes up with a plan to fulfill the request depending on the capabilities of available agents at run-time. In the static approach, given a set of agents, composite agents are defined manually at design-time combining their capabilities.

This implies the existence of an agent marketplace / registry of agents?—?with a well-defined description of the agent capabilities and constraints. For example, let us consider a house painting agent whose services can be reserved online (via credit card). Given this, the fact that the user requires a valid credit card is a constraint, and the fact that the user’s house will be painted within a certain timeframe are its capabilities. In addition, we also need to consider any constraints of the agent during the actual execution phase, e.g., the fact that the agent can only provide the service on weekdays (and not on weekends). In general, constraints refer to the conditions that need to be satisfied to initiate an execution and capabilities reflect the expected outcome after the execution terminates. Refer to [4] for a detailed discussion on the discovery aspect of AI agents.

Given the need to orchestrate multiple agents, we also the need for an integration layer supporting different agent interaction patterns, e.g., agent-to-agent API, agent API providing output for human consumption, human triggering an AI agent, AI agent-to-agent with human in the Loop. The integration patterns need to be supported by the underlying AgentOps platform.

Andrew Ng recently talked about this aspect from a performance perspective:

Today, a lot of LLM output is for human consumption. But in an agentic workflow, an LLM might be prompted repeatedly to reflect on and improve its output, use tools, plan and execute multiple steps, or implement multiple agents that collaborate. So, we might generate hundreds of thousands of tokens or more before showing any output to a user. This makes fast token generation very desirable and makes slower generation a bottleneck to taking better advantage of existing?models.

To accommodate multiple long-running agents, we also need a shared memory layer enabling data transfer between agents, storing interaction data such that it can be used to personalize future interactions.

Finally, the governance layer. We need to ensure that data shared by the user specific to a task, or user profile data that cuts across tasks; is only shared with the relevant agents (authentication & access control). We further consider the different Responsible AI dimensions in terms of data quality, privacy, reproducibility and explainability to enable a well governed AI agent platform.

Stateful Agent Monitoring

Stateful execution [5] is an inherent characteristic of any distributed systems platform, and can be considered as a critical requirement to materialize the orchestration layer of an AI agent platform.?

Given this, we envision that agent monitoring together with failure recovery will become more and more critical as AI agent platforms become enterprise ready, and start supporting productionized deployments of AI agents.

However, monitoring AI agents (similar to monitoring large-scale distributed systems) is challenging because of the following reasons:

  • No global observer: Due to their distributed nature, we cannot assume the existence of an entity having visibility over the entire execution. In fact, due to their privacy and autonomy requirements, even the composite agent may not have visibility over the internal processing of its component agents.
  • Non-determinism: AI agents allow parallel composition of processes. Also, AI agents usually depend on external factors for their execution. As such, it may not be possible to predict their behavior before the actual execution. For example, whether a flight booking will succeed or not depends on the number of available seats (at the time of booking) and cannot be predicted in advance.
  • Communication delays: Communication delays make it impossible to record the states of all the involved agents instantaneously. For example, let us assume that agent A initiates an attempt to record the state of the composition. Then, by the time the request (to record its state) reaches agent B and B records its state, agent A’s state might have changed.
  • Dynamic configuration: The agents are selected incrementally as the execution progresses (dynamic binding). Thus, the “components” of the distributed system may not be known in advance.

To summarize, AgentOps monitoring is critical given the complexity and long running nature of AI agents. We define agent monitoring as the ability to find out where in the process the execution is and whether any unanticipated glitches have appeared. We discuss the capabilities and limitations of acquiring agent execution snapshots with respect to answering the following types of queries:

  • Local queries: Queries which can be answered based on the local state information of an agent. For example, queries such as “What is the current state of agent A’s execution?” or “Has A reached a specific state?”. Local queries can be answered by directly querying the concerned agent provider.
  • Composite queries: Queries expressed over the states of several agents. We assume that any query related to the status of a composition is expressed as a conjunction of the states of individual agent executions. Examples of status queries: “Have agents A, B and C reached states x, y and z respectively?” Such queries have been referred to as stable predicates in literature. Stable predicates are defined as predicates which do not become false once they have become true.
  • Historical queries: Queries related to the execution history of the composition. For example, “How many times have agents A and B been suspended?”. If the query is answered using an execution snapshot algorithm, then it needs to be mentioned that the results are with respect to a time in the past.
  • Relationship queries: Queries based on the relationship between states. For example, “What was the state of agent A when agent B was in state y?” Unfortunately, execution snapshot based algorithms do not guarantee answers for such queries. For example, we would not be able to answer the query unless we have a snapshot which captures the state of agent B when it was in state y. Such predicates have been referred to as unstable predicates in literature. Unstable predicates keep alternating their values between true and false?—?so are difficult to answer based on snapshot algorithms.

We outline the AI agent monitoring approach and solution architecture in the next section.

AI Agent Monitoring Architecture and Snapshot Algorithm

We assume the existence of a coordinator and log manager corresponding to each agent as shown in the below figure. We also assume that each agent is responsible for executing a single task / operation.

Fig 2: Agent monitoring infrastructure

The coordinator is responsible for all non-functional aspects related to the execution of the agent such as monitoring, transactions, etc. The log manager logs information about any state transitions as well as any messages sent/received by the agent. The state transitions and messages considered are as outlined in the below figure:

Fig 3: Agent execution lifecycle

  • Not?—?Executing (NE): The agent is waiting for an invocation.
  • Executing (E): On receiving an invocation message (IM), the agent changes its state from NE to E.
  • Suspended (S) and suspended by invoker (IS): An agent, in state E, may change its state to S due to an internal event (suspend) or to IS on the receipt of a suspend message (SM). Conversely, the transition from S to E occurs due to an internal event (resume) and from IS to E on receiving a resume message (RM).
  • Canceling (CI), canceling due to invoker (ICI) and canceled (C): An agent, in state E/S/IS, may change its state to CI due to an internal event (cancel) or ICI on the receipt of a cancel message (CM). Once it finishes cancellation, it changes its state to C and sends acCanceled message (CedM) to its parent. Note that cancellation may require canceling the effects of some of its component agents.
  • Terminated (T) and compensating (CP): The agent changes its state to T once it has finished executing the operation. On termination, the agent sends a terminated message (TM) to its parent. An agent may be required to cancel an operation even after it has finished executing the operation (compensation). An agent, in state T, changes its state to CP on receiving the CM. Once it finishes compensation, it moves to C and sends a CedM to its parent agent.

We assume that the composition schema (static composition) specifies a partial order for agent operations. We define the happened-before relation between agent operations as follows:

An operation a happened-before operation b (a → b) if and only if one of the following holds?

  1. there exists a control / data dependency between operations a and b such that a needs to terminate before b can start executing.?
  2. there exists an operation c such that a → c and c → b.

An operation, on failure, is retried with the same or different agents until it completes successfully (terminates). Note that each (retrial) attempt is considered as a new invocation and would be logged accordingly. Finally, to accommodate asynchronous communication, we assume the presence of input/output (I/O) queues. Basically, each agent has an I/O queue with respect to its parent and component agents?—?as shown in Fig. 2.

Given synchronized clocks and logging (as discussed above), a snapshot of the hierarchical composition at time t would consist of the logs of all the “relevant” agents until time t.

The relevant agents can be determined in a recursive manner (starting from the root agent) by considering the agents of the invoked operations recorded in the parent agent’s log until time t. If message timestamps are used then we need to consider the skew while recording the logs, i.e., if a parent agent’s log was recorded until time t then its component agents’ logs need to be recorded until (t + skew). The states of the I/O queues can be determined from the state transition model.

Responsible AgentOps

The growing adoption of Generative AI, esp. with respect to the adoption of Large Language Models (LLMs), has reignited the discussion around Responsible AI— to ensure that AI/ML systems are responsibly trained and deployed.

The table below summarizes the key challenges and solutions in implementing responsible AI for AI agents vs.?

  • ChatGPT style LLM APIs?
  • LLM fine-tuning: LLMs are generic in nature. To realize the full potential of LLMs for enterprises, they need to be contextualized with enterprise knowledge captured in terms of documents, wikis, business processes, etc. This is achieved by fine-tuning a LLM with enterprise knowledge / embeddings to develop a context-specific LLM / Small Language Model (SLM)
  • Retrieval-Augmented-Generation (RAG): Fine-tuning is a computationally intensive process. RAG provides a viable alternative by providing additional context with the prompt, grounding the retrieval / responses to the given context. The prompts can be relatively long, so it is possible to embed enterprise context within the prompt.


Fig 4: Responsible AI integrated with AgentOps

We expand on the above points in the rest of the article to enable an integrated AgentOps pipeline with responsible AI governance.

Data Consistency: The data used for training (esp., fine-tuning) the LLM should be accurate and precise, which means the relevant data pertaining to the specific use-case should be used to train the LLMs, e.g. if the use case is to generate summary of a medical prescription?—?the user should not use other data like Q&A of a diagnosis, user must use only medical prescriptions and corresponding summarization of the prescription. Many a times, data pipelines need to be created to ingest the data and feed that to LLMs. In such scenarios, extra caution needs to be exercised to consume the running text fields as these fields mostly hold inconsistent and incorrect data.

Bias/Fairness: With respect to model performance and reliability, it is difficult to control undesired biases in black-box LLMs, though it can be controlled to some extent by using uniform and unbiased data to fine-tune the LLMs and/or contextualize the LLMs in a RAG architecture.

Accountability: To make LLMs more reliable, it is recommended to have manual validation of the LLM's outputs. Involving humans ensures if LLMs hallucinate or provide wrong response, a human can evaluate and make the necessary corrections.

Hallucination: In case of using LLM APIs or orchestrating multiple AI agents, hallucination likelihood increases with the increase in the number of agents involved. The right prompts can help but only to a limited extent. To further limit the hallucination, LLMs need to be fine-tuned with curated data and/or limit the search space of responses to relevant and recent enterprise data.

Explainability: Explainability is an umbrella term for a range of tools, algorithms and methods, which accompany AI model inferences with explanations. Chain of Thought (CoT) is a framework that addresses how a LLM is solving a problem. CoT can be primarily implemented using two approaches:

  • User prompting: Here, during prompting, user provides the logic about how to approach a certain problem and LLM will solve similar problem using same logic and return the output along with the logic.
  • Automating CoT prompting: Manual handcrafting CoT can be time consuming and provide sub optimal solution, Automatic CoT (Auto-CoT) can be leveraged to generate the reasoning chains automatically thus eliminating the human intervention. Auto-CoT basically works on two processes: 1. Question clustering: Cluster the questions of a given dataset. 2. Demonstration sampling: Select the representative question from each cluster and generate the reasoning chain using zero shot CoT. Auto-CoT works well for LLMs having approximately 100B parameters but not so accurate for the SLMs.

Conclusion

Agentic AI is a disruptive technology, and there is currently a lot of interest and focus in making the underlying agent platforms ready for enterprise adoption. Towards this end, we outlined a reference architecture for AI agent platforms. We primarily focused on two aspects critical to enable scalable and responsible adoption of AI agents?—?an AgentOps pipeline integrated with monitoring and Responsible AI practices.

From am agent monitoring perspective, we focused on the challenge of capturing the state of a (hierarchical) multi-agent system at any given point of time (snapshot). Snapshots usually reflect a state of a distributed system which “might have occurred”. Towards this end, we discussed the different types of agent execution related queries and showed how we can answer them using the captured snapshots.

To enable responsible deployment of agents, we highlighted the Responsible AI dimensions relevant to AI agents; and showed how they can be integrated in a seamless fashion with the underlying AgentOps pipelines. We believe that these will effectively future-proof Agentic AI investments and ensure that AI agents are able to cope as the AI agent platform and regulatory landscape evolves with time.

References

  1. J.S. Park, et. al. Generative Agents: Interactive Simulacra of Human Behavior, 2023. https://arxiv.org/abs/2304.03442
  2. G. Weiss. Multiagent Systems. MIT Press, 2016. https://mitpress.mit.edu/9780262533874/multiagent-systems/
  3. A. Birdes, et. al. Learning end-to-end Goal-oriented Dialog, 2016. https://arxiv.org/abs/1605.07683
  4. D. Biswas. Constraints Enabled Autonomous Agent Marketplace: Discovery and Matchmaking. In proc. of the 16th International Conference on Agents and Artificial Intelligence (ICAART), 2024. https://www.scitepress.org/Link.aspx?doi=10.5220/0012461700003636
  5. J. Lu, et. al. ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities, 2024. https://machinelearning.apple.com/research/toolsandbox-stateful-conversational-llm-benchmark

SHEKHAR BABAR

Data Scientist | Machine Learning | Data Science Trainer | Data Engineering

2 个月

Clearly written and concise article with valuable insights, thanks for sharing Debmalya Biswas

Debmalya Biswas

AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA

2 个月
Woodley B. Preucil, CFA

Senior Managing Director

2 个月

Debmalya Biswas Very Informative. Thank you for sharing.

Arvind R.

GenerativeAI | Computer Vision | Data Analytics | Innovator | Product | Strategy

2 个月

agent in serverless environment, while is extremely economical, managing its state and context is a challenge, particularly in multi-user and multi-transactional environment. This is a good article to develop and build highly parallel and highly scalable architecture to support multi LLM with multiple users while maintaining the context of each user transaction.

Shanti Ranjan Dash

Practice Director-UnitedLayer, Cloud, BSM, Observability & AIOps, Digital Transformation |ISB Chief Technology Officer | IIMBG AI/ML Business Analytics

2 个月

Good Insight Deb !!!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了