Insights from the frontline: Testing AI systems
As many of our solutions go from pilot to production, we are learning a lot of how AI systems behave at scale. And given it is fundamentally a probabilistic software, there are nuances that are not very obvious if you follow the software development best practices.
One of the areas which tend to be most problematic is Testing. For customer service knowledge retrieval / guided sales kind of applications where there is some version of chatbot capability is in the mix here are a few things we have figured:
1. Automated LLM testing score such as ROUGE, BLU scores are not very useful: Especially when developing a Q&A interface with significant business nuances
2. Real testing happens during pilot phase: It is hard to optimize for testing upfront in development cycle. Testing with 40-50 questions is good for the MVP stage of product but is not sufficient to test a “Product” for both accuracy and performance (e.g., responsiveness, to be able manage multiple users using it simultaneously). One way to do it is use LLM to take the 40-50 carefully curated test scenarios to 400-500 by changing the question parameters (asking same question for product-1, product-2, product-3 and so on). And then automating the testing process.
领英推荐
3. Concordance testing is key: Testing with ground truth (i.e., ideal response) will provide visibility if the responses are “Accurate and Complete”, “Accurate but incomplete” or “Inaccurate”. A lot of testing is measure of script concordance with experts. It is a concept that is widely used in medicine for clinical diagnosis and very applicable in the world of LLMs
4. Getting the right users in pilot is as critical as the AI solution to be tested: It is also important to select the pilot users who understand that an AI system becomes better on feedback. So appropriate expectations need to be set-up with the pilot users. Normally they are the more senior and experts that are better at catching issues and have the incentive to build a system that scales their expertise.
Credits: insights shared by Satish Tammineni Kritika B. Varun Sharma Vikram Raju AuxoAI based on their work across clients
Chief Strategy at Equiem | Founder spaceOS | PropTech | FinTech | AI
3 个月Interesting