Agentic AI evaluation framework

Agentic AI evaluation framework

Disclaimer: This article represents my thoughts alone as an individual and not as a representative of my employer or their business direction. This guidance is provided without any implicit or explicit promises/guarantees. Please exercise caution while using the content in this article and use your own due diligence. This article was envisioned and created by me, with refinements made using Generative AI


What is Agentic AI?

There is a lot of hype around Agentic AI from the foundation model makers and software makers alike. There are many definitions and interpretations of Agentic AI, so let's first define what Agentic AI means for the purpose of this article. "AI enabled applications that can take inputs in one or more forms, built to achieve specific goals using a range inputs and environmental variables."

In plain speak, Agentic AI are apps that are built to achieve the intended goals with minimal or no supervision (own agency) and with a large array of input and environmental variables by refining them, reasoning over them, generating steps to solve them and executing the tasks.

All the foundation model creators have touted this as the year of Agentic AI and software makers are gearing up for the wave of customers wanting to build Agentic AI to replace their apps.

Why should businesses build Agentic AI?

With the ability to self-reason any given goal/objective, break it up into smaller steps, execute the steps and iteratively come up with a solution, Agentic AI does present exciting opportunity for companies to tackle areas such as web scraping, knowledge search, monitoring, reporting generation from data sources and many more areas. As you can see, these are typical areas where the type and number of inputs can be large and defining traditional way of computing those input and environmental combinations become laborious and cumbersome, which are ideally suited for generative AI based agents.

But should you though?

Not all use cases need tackling large variety of inputs and environmental factor combinations, thus requiring generative AI type processing. Many of the day-to-day tasks are deterministic a.k.a well-defined set of inputs, processes and outputs, thus Agentic AI would either be not a good investment and resources or simply wasteful.

While it is tempting to rip and replace all the deterministic apps a.k.a traditional apps in use today in favor of Agentic AI, businesses need to stop and take stock of the below factors:

  1. Cost of rebuilding
  2. Time to rebuild
  3. Effort to reintegrate with process and people training
  4. Integration with existing ecosystem of apps and services
  5. Life stage of the app
  6. Return on Invesment and
  7. Upside to rebuilding - Does it bring value to my business and users?

Agentic AI evaluation framework

This framework proposes an evaluation framework based on various factors listed below to determine if you should replace existing apps with Agentic AI. The framework can and should be adapted and extended to specific business task(s)/app requirements and the level of details needed as part of the assessment.

  • The assessment topics in this framework should be evaluated against a range of one till 100, with one being the most suited for Agentic AI rebuilding, while 100 being the least suited.
  • The suggested weights - How much each assessment dimension topic contributes to the overall assessment
  • Rating ranges - For each assessment dimension, I recommend using steps of 20 for recording the variance for a task or app.
  • Factor and Indicator scores - We consider a list of factors that influence the decision to rebuild an app. These factors are scored between One and Five

Core Assessment Dimensions

  1. Input variability (Suggested weight - 20%) (A)

Rate how variable or unpredictable the inputs to the system are:

--> 1-20: Highly variable inputs (unstructured text, complex user requests)

--> 21-40: Moderately variable inputs with some patterns

--> 41-60: Mix of structured and unstructured inputs

--> 61-80: Mostly structured inputs with occasional variations

--> 81-100: Completely structured, predictable inputs

2. Process complexity (Suggested weight - 25%) (B)

Evaluate the complexity of decision-making and processing required:

--> 1-20: Complex reasoning with multiple decision paths

--> 21-40: Multiple interconnected processes with some ambiguity

--> 41-60: Mix of straightforward and complex processes

--> 61-80: Mostly straightforward processes with few variations

--> 81-100: Simple, linear processes with clear rules

3. Human Interaction Requirements (Suggested weight: 15%) (C)

Assess the level and complexity of human interaction needed:

--> 1-20: Continuous dialogue and complex interactions

--> 21-40: Regular interactions with context understanding

--> 41-60: Periodic interactions with clear objectives

--> 61-80: Minimal interactions with structured inputs

--> 81-100: No human interaction needed

4. External Dependencies (Suggested weight: 15%) (D)

Evaluate the system's reliance on external factors:

--> 1-20: Multiple dynamic external dependencies

--> 21-40: Several semi-stable external dependencies

--> 41-60: Mix of stable and dynamic dependencies

--> 61-80: Few, well-defined external dependencies

--> 81-100: No external dependencies

5. Error Tolerance (Suggested weight: 25%) (E)

Assess the system's tolerance for errors and variations:

--> 1-20: High tolerance, approximate results acceptable

--> 21-40: Moderate tolerance, minor variations acceptable

--> 41-60: Mixed requirements for precision

--> 61-80: Low tolerance, few errors acceptable

--> 81-100: Zero tolerance, exact results required

Implementation Considerations

Cost Factors Assessment

Effort involved in converting the apps/task, which translates to time, man hours, expertise and resources (Ex: GPU) used in the process.

1. Development Complexity Score (1-5) (F): Scoring based on the rebuilding effort involved

?? --> 1: Simple conversion

?? --> 3: Moderate redesign

?? --> 5: Complete rebuild

2. Training Data Requirements (1-5) (G): Scoring on the effort involved to train the Agentic AI

?? --> 1: Minimal data needed

?? --> 3: Moderate data collection required

?? --> 5: Extensive data collection needed

3. Integration Complexity (1-5) (H): Scoring on the effort involved integrating Agentic AI with other systems

?? --> 1: Standalone system

?? --> 3: Moderate integration needs

?? --> 5: Complex integration requirements

ROI Indicators

Scoring on Return on investment (ROI) for rebuilding the app/task

1. Automation Potential (1-5) (I):

?? --> 1: Minimal automation gains

?? --> 3: Moderate efficiency improvements

?? --> 5: Significant automation potential


2. Maintenance Requirements (1-5) (J):

?? --> 1: Low maintenance

?? --> 3: Moderate maintenance

?? --> 5: High maintenance needs

Scoring Formula

This final section calculates the scores to predict the viability of converting the app/task

Primary Score Calculation:

Prim. Score = (A* 0.20) + (B * 0.25) + (C * 0.15) + (D * 0.15) + (E * 0.25)

Implementation Feasibility Score:

Impl. Score = (F + G +H) /3

ROI Score:

ROI Score = (I * 2 - J) / 3

Interpretation Guide

Below is the recommendation guidance based on the scores

Implementation Recommendations:

1. If Primary Score < 40 and ROI_Score > 3:

?? - Proceed with Agentic AI implementation

?? - Expected autonomy level: 70-90%

2. If Primary Score 40-60 and ROI_Score > 3:

?? - Consider hybrid approach by using a mix of Agent and traditional app

?? - Expected autonomy level: 40-70%

3. If Primary Score > 60 or ROI_Score < 3:

?? - Maintain traditional application

?? - Expected autonomy level: < 40%

Phani Kumar

Data Scientist | ML/Ai Engineer|Aws

2 个月

Hello Ravishankar, Beautiful.... enjoyed through out till end, really great blog.

回复

要查看或添加评论,请登录

Ravishankar N的更多文章

社区洞察

其他会员也浏览了