The platform priority

The platform priority

Business leaders are buzzing about generative AI. To help you keep up with this fast-moving, transformative topic, our regular column, “The Prompt,” brings you observations from the field, where Google Cloud leaders are working closely with customers and partners to define the future of AI. In this edition, Warren Barkley , Vertex AI product leader, discusses how to choose the right AI platform to maximize the impact of your gen AI investments and deliver long-lasting business value. This article first appeared on Transform with Google Cloud .


Driving measurable AI value goes beyond having access to amazing models; it also means being able to monitor models, ensure they produce the results you need, and make adjustments to enhance the quality of responses.?

One notable difference with gen AI compared to other types of AI technologies is the ability to customize and augment models. This includes using techniques like retrieval augmented generation (RAG) to ground responses in enterprise truth or prompt engineering , where users provide the model with additional information and instructions to improve and optimize their responses.

Foundation models are non-deterministic and can hallucinate, meaning they might produce different responses on different occasions, even when given the same prompt. In addition, many benefits of gen AI may be more intangible and difficult to quantify, such as increasing customer and employee satisfaction, boosting creativity, or enabling brand differentiation. In other words, it’s significantly more complex to track and measure gen AI compared to more traditional AI systems.

As a result, there’s a growing need for tools and processes that can provide a clear understanding of how well gen AI models and agents align with use cases but also detect when they are no longer performing as expected — without relying solely on human verification.

Evaluating and optimizing gen AI

Before gen AI models can be deployed into production, you’ll need to be able to validate and test a model’s performance, resilience, and efficiency through metrics-based benchmarking and human evaluation. However, manual evaluation typically becomes unsustainable as projects mature and the number of use cases increase. Many teams also find that it’s difficult to implement a practical evaluation framework due to limited resources, lack of technical knowledge, high-quality data, and the rapid pace of innovation in the market.

For example, Vertex AI’s? Gen AI Evaluation Service lets you evaluate any gen AI model or application and benchmark the evaluation results using your own evaluation criteria or pre-built metrics to assess the quality of summarization, Q&A, text generation, and safety. The service also allows you to do automatic side-by-side comparisons of different model variations — whether Google, third-party, or open-source models — to see which model works best for getting the desired output, along with confidence scores and explanations for each selection.

Already, we’ve seen customers enhance and accelerate their ability to move gen AI applications to production by reducing the need for manual evaluation. Generali Italia, a leading insurance provider in Italy, used the Gen AI Evaluation Service to evaluate the retrieval and generative functions of a new gen AI application that lets employees interact conversationally with documents, such as policy and premium statements. Using the service, Generali Italia reduced the need for manual evaluation while making it easier for teams to objectively understand how models work and the different factors that impact performance.

Overall, a gen AI evaluation service is not only a valuable tool for model selection but also for assessing various aspects of a model’s behavior over time, helping to identify areas for improvement and even recommending changes to enhance performance for your use cases. Evaluation feedback can also be combined with other optimization tools to continuously refine and improve models.

For instance, we recently introduced the Vertex AI Prompt Optimizer , now in Public Preview, which helps you find the best prompt to get the optimal output from your target model based on multiple sample prompts and the specific evaluation metrics you want to optimize against. Additionally, taking other steps to help guarantee reliability and consistency, such as ensuring that a model adheres to a specific response schema , can help you better align model outputs with quality standards, leading to more accurate, reliable, and trustworthy responses over time.

Putting evaluation at the heart of your gen AI system

Overall, a well-structured generative AI evaluation framework on a trusted, one-stop-shop platform can help organizations move faster and more responsibly, enabling them to deploy more gen AI use cases. The biggest shift that comes with AI is adopting a metrics-driven approach to development rather than a test-driven one. With each stage of the AI journey, it’s important to have a clear understanding of any changes you want to introduce, the desired outcome, and the key performance indicators you want to measure .

With that in mind, I wanted to end with some best practices we have found helpful when working with customers on their approach to evaluation:

  • Make your evaluation criteria task-specific. The metrics you use to evaluate Q&A won’t be the same ones you use for summarization. Evaluation criteria should not only take into account the type of task but also the individual use cases and your organization’s specific business requirements. This is always the best place to start.
  • Create a strong “test set.” A “test set” is a collection of input prompts or questions used to assess the performance of a gen AI model, providing a standardized way to measure how well a model performs on different tasks and types of inputs.
  • Use more than one type of evaluation. Evaluating models effectively may mean using multiple methods depending on what you’re assessing. Computation evaluation, for instance, compares generated outputs to your ground truth response. Autoraters — AI models designed to perform evaluation — can help you to automatically assess outputs, such as text, code, images, or even music, against another model. Finally, evaluations may need to be escalated to humans when there is low confidence in results.
  • Start simple. The unpredictability of gen AI models is amplified, especially when chaining models together to build gen AI agents. You’ll need to evaluate both individual models as well as the overall combined system, so it’s crucial to start simple and add complexity as you go.

If you're interested in learning more about Google Cloud's AI product strategy, recent Gemini advancements, product updates, and ongoing areas of investment, check out this webinar .



Nirob Hossain

Attended Tmz later when

3 天前

Very helpful

回复
Yang Pei

Technical Development Manager

4 天前

I’ve been using GCP to scale up AI rapid prototyping in the company… it’s really fun…

回复
Steve Jarrett

Chief AI Officer at Orange. Making people’s everyday lives better through AI.

1 周

A great post. Thanks Warren Barkley. Also thanks for your continuing investment in such great tools for use with any AI model. Evals and side-by-side comparison are so difficult to manage well at scale. Next please help us manage the lifecycle of and compare between so many different RAG implementations! ??

要查看或添加评论,请登录