登录查看更多内容

Insights from the frontline: Testing AI systems

Amaresh Tripathy

Transforming enterprises through AI

发布日期: 2024年12月17日

As many of our solutions go from pilot to production, we are learning a lot of how AI systems behave at scale. And given it is fundamentally a probabilistic software, there are nuances that are not very obvious if you follow the software development best practices.

One of the areas which tend to be most problematic is Testing. For customer service knowledge retrieval / guided sales kind of applications where there is some version of chatbot capability is in the mix here are a few things we have figured:

1. Automated LLM testing score such as ROUGE, BLU scores are not very useful: Especially when developing a Q&A interface with significant business nuances

2. Real testing happens during pilot phase: It is hard to optimize for testing upfront in development cycle. Testing with 40-50 questions is good for the MVP stage of product but is not sufficient to test a “Product” for both accuracy and performance (e.g., responsiveness, to be able manage multiple users using it simultaneously). One way to do it is use LLM to take the 40-50 carefully curated test scenarios to 400-500 by changing the question parameters (asking same question for product-1, product-2, product-3 and so on). And then automating the testing process.

领英推荐

AI Agents in 2025: The Future of Work Is Here! ??

ChandraKumar R Pillai 1 个月前

Learn how AI is automating processes, reducing errors,…

Daffodil Software 9 个月前

What is Agentic AI: Benefits and Use Cases

Enterprise64 4 周前

3. Concordance testing is key: Testing with ground truth (i.e., ideal response) will provide visibility if the responses are “Accurate and Complete”, “Accurate but incomplete” or “Inaccurate”. A lot of testing is measure of script concordance with experts. It is a concept that is widely used in medicine for clinical diagnosis and very applicable in the world of LLMs

4. Getting the right users in pilot is as critical as the AI solution to be tested: It is also important to select the pilot users who understand that an AI system becomes better on feedback. So appropriate expectations need to be set-up with the pilot users. Normally they are the more senior and experts that are better at catching issues and have the incentive to build a system that scales their expertise.

Credits: insights shared by Satish Tammineni Kritika B. Varun Sharma Vikram Raju AuxoAI based on their work across clients

Marley Fabisiewicz

Chief Strategy at Equiem | Founder spaceOS | PropTech | FinTech | AI

3 个月

Interesting

1 次回应

查看更多评论

要查看或添加评论，请登录

Amaresh Tripathy的更多文章

The Enterprise AI Journey: Readiness, Acceleration, Reimagination

2025年1月15日

The Enterprise AI Journey: Readiness, Acceleration, Reimagination

What are the patterns of AI investment in Enterprises? At AuxoAI, we’ve seen our clients focusing on three steps in…

8 条评论
AI Realism: Reflections from CDO Summit

2024年8月28日

AI Realism: Reflections from CDO Summit

AIM Research hosted a group of 30 or so data and AI leaders in Dallas last week. There was representation across…

10 条评论
GenAI Use Cases: Where to Start?

2024年7月9日

GenAI Use Cases: Where to Start?

One of the frequent topics of discussions we are having with AuxoAI clients are "GenAI sounds great. But how do we…

5 条评论
Guide to Designing AI-native Processes

2024年6月25日

Guide to Designing AI-native Processes

Most AI use cases are "today forward" and they try to leverage AI to automate steps in a process. Such AI-enabled…

5 条评论
Solving the Two Sigma Problem in Workplace : Why Every Knowledge Worker Needs a Copilot

2023年10月12日

Solving the Two Sigma Problem in Workplace : Why Every Knowledge Worker Needs a Copilot

In the realm of education, the Two Sigma Problem has been widely discussed. Coined by Benjamin S.

7 条评论
Three things to get right when scaling your data analytics operation

2021年6月15日

Three things to get right when scaling your data analytics operation

I wrote earlier this year about how the heightened pressure and need for quick, accurate decision-making during the…

5 条评论
Three ways the art of data analytics is like playing blackjack

2021年1月27日

Three ways the art of data analytics is like playing blackjack

One of the biggest challenges data chiefs like myself face is explaining the benefits of the work we do, so a…

4 条评论

See all articles

Insights from the frontline: Testing AI systems

Amaresh Tripathy

Transforming enterprises through AI

领英推荐

Amaresh Tripathy的更多文章

社区洞察

其他会员也浏览了

Lets Understand Prompt Engineering

The New Era of Thinkers and AI Agents

How AI Differs from Regular Software

A Cautionary Tale: The Poorly Defined Problem Statement

Unlocking the Potential of Generative AI for Enterprise Applications

So, what is AI good for in support?

Let's Change How We Work and Learn!

The Future Role of Support Engineers in the World of AI

coTestPilot: AI for Human?Testers

Benefits of AI in Software Testing

领英推荐

Amaresh Tripathy的更多文章

The Enterprise AI Journey: Readiness, Acceleration, Reimagination

AI Realism: Reflections from CDO Summit

GenAI Use Cases: Where to Start?

Guide to Designing AI-native Processes

Solving the Two Sigma Problem in Workplace : Why Every Knowledge Worker Needs a Copilot

Three things to get right when scaling your data analytics operation

Three ways the art of data analytics is like playing blackjack

社区洞察

其他会员也浏览了

Lets Understand Prompt Engineering

The New Era of Thinkers and AI Agents

How AI Differs from Regular Software

A Cautionary Tale: The Poorly Defined Problem Statement

Unlocking the Potential of Generative AI for Enterprise Applications

So, what is AI good for in support?

Let's Change How We Work and Learn!

The Future Role of Support Engineers in the World of AI

coTestPilot: AI for Human?Testers

Benefits of AI in Software Testing