(From the discussion with LLM expert and leader @Meta: Bing Liu)
When assessing the quality of generative AI products, it can be challenging to determine which factors to consider. While user retention is typically the key metric for most user-facing products, it is a lagging metric that can only be measured online. Therefore, how can we evaluate the quality of the #LLM (the core of the #genai product) prior to shipping it?
There are three main dimensions that we can examine: helpfulness, harmlessness, and latency.
- Helpfulness: Language understanding and generation: this refers to the fundamental aspects of the model's output. We expect the model to understand our requests and respond to us in fluent, coherent, and natural language. Relevance: this measures how relevant the generated text is to the input or intended context. It can be the accuracy of the answer in question-answering tasks, or how well the model can generate content or media that is relevant to the input or intended context. Diversity and Creativity: when the product is meant not only for information synthesis but also to help us create content, users will expect some novelty and creativity from the output.
- Harmlessness: Bias and Fairness: from a social responsibility and PR perspective, we want the LLM output to be fair for gender, race, and free from harmful stereotypes. User trust, safety, privacy: this includes various ethical implications, such as privacy, misinformation, and potential harm. Handling of Ambiguity and Edge Cases: we also need to check how LLM handles ambiguous input or unusual scenarios. We must ensure that it doesn't produce incorrect or misleading responses in such cases.
- Latency: the model's response time is a key element in meeting users' expectations. In the LLM context, it is often measured as a) Time to the first word (token), and b) Avg time for generating each subsequent words (token).
By breaking down product quality into these tangible and measurable dimensions, we can better understand the strengths and weaknesses of the generative AI product. This information is essential in helping us optimize the product in the next cycle.