Seeing is Referee-ing: Using Vision Models for Content Evaluation
While 2024 may be remembered as the year that generative AI hype peaked and crashed, we’ll also look back on 2024 as the year another breakthrough technology took a step forward from hype to reality. The Metaverse, and specifically hardware to enable AR/VR/ MR, picked up momentum with the public release of the Apple Vision Pro (to mixed reviews), and developer prototypes made available for Snap’s new Spectacles, and Meta’s Orion glasses. These devices are all early in their lifecycle but make it easier to imagine a future of widespread adoption of virtual worlds filled with AI-generated content.
Of course, as adoption of AI-generative content grew this year, there were also plenty of “fail” moments, one of the most high profile being Google Gemini’s image generation feature getting pulled from public availability because of its inability to create historically accurate images.? AI image generation tools are bringing people to question the age-old mantra that ‘seeing is believing’. However, I would argue that with a responsible AI-first mindset, and the right technical workflows, we might be able to create a new paradigm where ‘seeing is referee-ing’.?????
To explore that hypothesis, in this article I'll lay out a framework that uses vision models for content evaluation, specifically with the goal of augmenting the ability of human moderators to evaluate synthetic content effectively at scale. This workflow was inspired by the paper eXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models.
Approach
We’re going to explore a simple, illustrative workflow built in python with the following steps:
Tech Stack?
Here's a brief overview, with the full repo available here:
Inputs
Components
Tooling
Step by Step
1) Setup
API Keys: You’ll need an API key for:
2) Evals Prep
Let’s dive into the evals functions as that’s the key feature of this tutorial, within the helper_functions.py file.
2.1 Define content guardrails?
I created the content guardrail categories and descriptions within the code below:
2.2 Evaluation Functions?
get_vision_completion():
evaluate_image_for_content_with_examples():
领英推荐
There is also a utility function (not shown above), process_image(), to load images and prepare them to be converted to a base64 string. It’s important to ensure consistency of file formats between image loading, base64 string conversion, and the media_type argument passed to the vision model within the API call above.?
3) Executing the Evals
Now, we’re ready to dive into the jupyter notebook to run the core evals loop and pass that to Arize, our monitoring platform. Let’s walk through the code in evalution_notebook.ipynb:
First, you’re going to see the basic steps of importing libraries and data files. We’re loading pre-generated images from a training run of the Scenes From Tomorrow project for climate change education, which you can read more details about here if you are interested. We’ll load a dataframe with metadata, a unique ID, and a timestamp of generation for each image. We’ll revisit these results again at the end of the workflow to see how to identify images that failed the content guardrails checks.?
3.1 Core Evals Workflow
In this cell we call the functions defined above within an execution block to generate a dataframe containing the evaluation results for each of the 100 images:
4) Stage to our Evaluation Platform
The next few cells send prepare the data schema and push the evaluations to Arize for storage and visualization. These cells are based on the Arize documentation here for connection set-up and here for column schema set-up. Note that the schema in this notebook for vision model content evaluation case is adapted from one designed for binary classification. The column names have been force-fitted a bit, which you’d want to clean up and standardize before putting this into an ongoing process.
5) Evaluate & Visualize
Finally, we’ll look at the results. We can do this locally to start as a gut-check on the workflow and then check out dashboards in Arize.
Next let’s take a look at Arize. Using the built-in Arize dashboard features I published a very rough POC of what a monitoring dashboard would look like for a production application where we have a visual record of when content violations occurred. In an actual application, we’d also want to orchestrate additional alerts and notifications to be triggered from within the application to ensure the right escalation and resolution is put in place.?
This dashboard filters for 'YES' values, to identify a violation of a content check, giving us a glanceable way to determine when the event occurred.
Final Thoughts
Technical Performance and Augmenting Human Moderation
It takes around 10 seconds to evaluate each image. Assuming that a human moderator needs a minimum of around 1-2 minutes to evaluate a piece of content (especially across multiple dimensions), this workflow could significantly scale up human moderators ability to evaluate content. Considering the exponential scale at which synthetic content is likely to grow over the next few years, such an improvement seems critical.
Implications for Business and Content Safety
Looking forward, it’s easy to imagine a future where industry-required third party verification of AI generated images becomes an essential pillar to maintain consumer trust. The digital media industry offers an example here with industry associations that adopt common standards (such as the IAB, Internet Advertising Bureau), along with 3rd party partners such as DoubleVerify who provide verification of campaign delivery within agreed-upon benchmarks.
Hopefully this is helpful and please let me know what you think in the comments!