Intro to AI Rubrics for Product Managers
Scott Germaise
Digital Product Management Leader | Strategy Development | Start-up Expertise | Roadmap/Requirements | KPI Planning | Acquisition Due Diligence | Team Building/Leadership | Budgeting | Vendor Management |
What is a Rubric for AI Products?
A rubric is about evaluation and quality control, but also standardization, consistency and more. The origin of rubrics is from education and assessment so the term may be new to a digital product person. The general idea is to have a highly structured way to evaluate qualitative judgments. This seemed to be somewhat parallel to what was needed to evaluate AI output, so the model was adapted for that purpose. Rubrics for AI evaluation are used in academia, by tech companies, and regulatory and standards bodies. For traditional development, we have a variety of QA standards. A lot of them involve unit and integration testing and in modern workflows is often part of a continuous development and deployment plan. Rubrics can also be used along a development path, during early evaluation and fine tuning, pre-deployment, and for ongoing testing. However, at least a rough model must be fully available.
In the case of AI model quality assessment, a rubric is a structured framework for evaluation.
In the context of GenAI, rubrics help measure key metrics such as accuracy, effectiveness, reliability, UX quality, fairness, ethical considerations, explainability and perhaps more aspects of AI-driven features. Rubrics may be tested by some automated tools, but they're generally designed for humans to test output.
How is a rubric used? As test assessments are completed, results can be used by AI technical teams to refine models, improve them, and enhance quality. Results can point to where model fine-turning could be helpful, assess improvement of new versions, do competitive analysis, and assess if a model is ready for deployment.
The goal is to strive for a standardized assessment, even for output that needs qualitative judgments. A good rubric will allow different evaluators to use similar standards. And for tools to compare different ratings.
You will need rubrics for your GenAI products. If you are the leading product person on an AI project, you may be responsible for leading rubric development, though not necessarily creating it. If there's an AI/ML specific product manager, this is almost certainly their responsibility. Regardless of who owns the deliverable, the build out of a rubric is a team sport; possibly a large team! Collaboration needs to be among several stakeholders; product, ML Engineers and Data Scientists, possibly a Responsible AI/Ethics team, Policy/Legal experts, QA/Test engineers, and in some cases domain experts. Who are we forgetting? Oh yes, end users. Ideally you can have customers participating in evaluation. If not using such evaluations, you're asking for trouble. Weak evaluations could be inconsistent and risky. This could lead to even more hallucinations and factual inaccuracies, bias and fairness problems, ethical and legals risks, plus challenges in improving models.
Developing a rubric is not generally what one might think of as a UX designer's responsibility. And yet, the output of a GenAI is the whole value of the experience. However, rubric development is most likely going to be in the hands of an AI product manager due to its cross functional nature.
How is a Rubric Structured?
You'll start with some kind of general task workflow for evaluating prompts themselves and the output along the way. The goal is for an evaluator to run the rubric checklist as they execute this workflow. What's being assessed are the responses to prompts. Remember that prompts can be incredibly complex. In traditional search, we can look at results and judge obvious classic metrics such as relevance and precision, among others. With GenAI, we have prompts that may be multi-turn, Chain-of-Thought, (CoT), and additional structures, including specific constraints. We will likely need to test all along this path. Sometimes, an early criteria in a rubric might be seeing if a reasoning model figures out tools it needs to even start reasoning towards an answer. E.g., a user asks for best movies for their child who's interested in some topic or another. The tool might first need to properly select a movie database before doing an initial query to gather vector based data prior to inferencing. Part of the evaluation is "Did the AI select the right tool." (Regardless of whether it came up with correct answers later.)
A rubric is usually going to be an in depth set of criteria. The document defining it will be long. The complexity of getting it done will be non-trivial. The users performing it will have a significant time commitment. You may consider jobbing this task out to firms that specialize in doing so. However, you will likely want to have an internal cohort of some sort also running the list to serve as a quality check against any external contractors.
In a moment, we're going to look at some possible task dimensions for rubrics. These are just some examples for a text oriented GenAI tool. Clearly, there will be entirely different criteria for other object types such as images, audio and so on. If your tool is multi-modal, you'll need criteria for all of them, plus what integration looks like.
Be aware that some of the dimensions are qualitative and therefore inherently subjective. At least to a degree. You will offer evaluation criteria. Yet such things are likely still at least somewhat judgmental. With such evaluations, the same prompt may be checked by multiple evaluators. In this sense, you're also "testing the test and the testers." If you're finding large differences in evaluation, you may need to re-visit the rubric itself, your evaluators or their training.
Rubric Samples
Following is a very small sample of some rubric dimensions and how they might be evaluated just to give you an idea of the kinds of things being checked.
The following table shows a sample of what criteria dimensions might look like. Just note that I made this up quickly; some rules for the criteria judgement could be significantly more involved, and there's usually a training document explaining how to use the criteria as well.
There will often be other comparisons of multiple output. An evaluator would have to clearly express things like, "@Response A is better then @Resonse B because @Response A follows the prompt instructions perfectly and has accurate information, where @Response B follows instructions, but had three verifiably false claims." Linters, (typically used to flag code issues), might be adapted and used to check such text responses against numerical ratings to make sure everything is aligned.
The output from evaluators can be used to offer feedback directly to the model. You may hear this called "Supervised Fine-tuning (SFT)" which uses labeled data from rubric evaluations to retrain models. Or "Reinforcement Learning from Human Feedback (RLHF)" If evaluators provide ranked comparisons or feedback, this can be used to refine the model’s reward function. For effective fine-tuning, the rubric data must be structured in a way the model can learn from, often requiring well-defined labels, structured annotations, or numerical scoring. (See: Supervised Fine-tuning: customizing LLMs, Understanding and Using Supervised Fine-Tuning (SFT), What is RLHF?, Illustrating Reinforcement Learning from Human Feedback (RLHF), What is reinforcement learning from human feedback (RLHF).)
This has been a VERY small sample of what makes up a rubric. Done fully, this usually ends up being a complex, time consuming and ongoing evaluation task. Make sure to allocate resources to get it done. We have seen plenty of examples of how even major players have been called out publicly for products that have had issues in these areas; from bias to inappropriate responses that can be outright dangerous. As with other vertical market needs, there are service providers that can help with evaluation tasks.
Conclusion
Rubrics are an essential tool for ensuring the quality, reliability, and ethical soundness of AI-generated outputs. While their development can be complex and resource-intensive, they provide a standardized framework for evaluating AI performance across multiple dimensions. As a product manager, you may not be responsible for creating a rubric from scratch, but you will likely play a key role in shaping its implementation and ensuring cross-functional collaboration. Given the growing scrutiny of AI-generated content, investing in rigorous evaluation processes is not just a best practice. It’s a necessity for building trust and delivering high-quality AI experiences.