QuantUniversity
kickoff its annual Fall AI school yesterday.
Agus Sudjianto
gave the fist guest lecture discussing approaches to validating and testing generative AI models, focusing on Retrieval Augmented Generation (RAG) systems. He argued that comprehensively validating large language models (LLMs) is difficult due to the vastness of their training data and the uncontrollability of user input, but argues that validation of RAG systems is possible due to their reliance on finite sets of documents. He presented a framework for RAG system validation that involves? curating question-answer pairs,? evaluating retrieval and generation capabilities, and? conducting human evaluation to calibrate the model's performance. He emphasized the use of interpretable metrics based on semantic similarity,? dimensionality reduction techniques for analysis, and? automated testing methods for scalability. He discussed the importance of? feature engineering and? prompt engineering in achieving successful validation.?
The slides and video from the presentation is available here:
- Validating AI for High-Stakes Applications: Traditional AI validation methods are inadequate for generative AI, especially in finance and medicine where errors carry significant risks.
- Conceptual Soundness and Outcome Analysis: Validation must encompass not just performance but also conceptual soundness, input design, explainability, and alignment with business goals.
- RAG Systems as a Validation Target: Comprehensive validation of generic LLMs is currently impossible due to their vast scope. This lecture focuses on validating RAG systems, which offer a more manageable scope for testing.
- Automated Testing and Explainable Metrics: Automated testing with stratified sampling ensures comprehensive coverage. Explainable metrics based on semantic similarity and embedding models are crucial for transparency and regulatory acceptance.
- Human-in-the-Loop Calibration: While automation is key, human evaluation and calibration remain essential for verifying algorithmic results and setting appropriate thresholds for production deployment.
- AI as a "Guessing Machine": Dr. Sudjianto emphasizes that AI, even generative AI, ultimately makes predictions based on patterns. This inherent uncertainty necessitates rigorous validation, especially in high-stakes fields.
- Limitations of Benchmarking: While useful for general comparisons, standardized benchmarks fail to address the specific risks and requirements of real-world business applications.
- Focus on Model Weakness: Identifying and diagnosing model weaknesses, particularly through explainable metrics, is paramount for building trust and mitigating potential harm.
- Importance of Embedding Models: Embedding models play a crucial role in validation, enabling semantic similarity calculations for evaluating retrieval quality, groundedness, and answer relevance.
- Data Science in Validation: Dr. Sudjianto highlights that model validation in the age of generative AI is a data science-heavy task, requiring expertise in areas like clustering, calibration, and statistical analysis.
- "AI is a guessing machine... Now we want to use them for cases where mistakes can cause real harm."
- "Comprehensive validation of [generic] LLMs is hard, and I would call it impossible... But comprehensive validation of RAG is possible."
- "We do not use LLMs as judge or jury... We need techniques, we need metrics that is very transparent, explainable, easy to understand."
- "The key that we're going to present today in model validation is really identification of model weakness and diagnostics with explainable diagnostics."
- "Fast thing is data science. Building it is a technology exercise."
1. On demand courses at www.quantuniversity.com
The Use and Governance of Generative AI at Large US Financial Institutions
Oct 8th with
Jacob Kosoff
,
美国银行
1/2 day Workshop on AI & Investing
Looking forward to your participation!