Seeing is Referee-ing: Using Vision Models for Content Evaluation
Imaged generated using ideogram.ai

Seeing is Referee-ing: Using Vision Models for Content Evaluation

While 2024 may be remembered as the year that generative AI hype peaked and crashed, we’ll also look back on 2024 as the year another breakthrough technology took a step forward from hype to reality. The Metaverse, and specifically hardware to enable AR/VR/ MR, picked up momentum with the public release of the Apple Vision Pro (to mixed reviews), and developer prototypes made available for Snap’s new Spectacles, and Meta’s Orion glasses. These devices are all early in their lifecycle but make it easier to imagine a future of widespread adoption of virtual worlds filled with AI-generated content.

Of course, as adoption of AI-generative content grew this year, there were also plenty of “fail” moments, one of the most high profile being Google Gemini’s image generation feature getting pulled from public availability because of its inability to create historically accurate images.? AI image generation tools are bringing people to question the age-old mantra that ‘seeing is believing’. However, I would argue that with a responsible AI-first mindset, and the right technical workflows, we might be able to create a new paradigm where ‘seeing is referee-ing’.?????

To explore that hypothesis, in this article I'll lay out a framework that uses vision models for content evaluation, specifically with the goal of augmenting the ability of human moderators to evaluate synthetic content effectively at scale. This workflow was inspired by the paper eXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models.

Approach

We’re going to explore a simple, illustrative workflow built in python with the following steps:

  1. Load and iterate through a series of synthetic images
  2. For each image, use a vision model to create both basic descriptive summaries, as well as classification-style, pass/fail evaluation based on pre-defined content guardrails
  3. Deploy evaluation results to a dedicated platform (here, we are usingArize) to show how we can operationalize content monitoring for apps in production
  4. Finally, we'll evaluate the workflow, exploring how to optimize the synthetic content creation prompts based on evaluation results, as well as insights into the performance (speed) of the evaluation ?

Tech Stack?

Here's a brief overview, with the full repo available here:

Inputs

Components

  • Jupyter notebooks: Most of the workflow occurs in the jupyter notebook for flexibility and customization based on user needs
  • Claude for vision evaluation with user-defined functions: The heavy lifting for moderation happens in the ‘helper_functions.py' file which we’ll walk through below. To adapt this project, you’ll almost certainly want to customize those functions as they contain specific instructions for how the Vision model (Claude) will evaluate your image.
  • Arize platform for logging and visualization, a leader in open-source ML and LLM monitoring, with its own API

Tooling

  • For this project I used Cursor, an IDE with built-in LLM integration that provides a phenomenal productivity boost!

Step by Step

1) Setup

API Keys: You’ll need an API key for:

  • The Vision model (e.g. Claude or similar)
  • Arize / observability platform: For Arize you need a space_id for your project as well as an API key. Go here to set up your account and then go to space settings to get your space id and API key
  • Environment setup: Load the requirements.txt file within your virtual environment
  • Load images: the image folder is dated '4.4.24', and the juptyer notebook will open images from this folder

2) Evals Prep

Let’s dive into the evals functions as that’s the key feature of this tutorial, within the helper_functions.py file.

2.1 Define content guardrails?

I created the content guardrail categories and descriptions within the code below:

  • A dictionary names the 8 risk areas to evaluate the content; grouped into 2 categories Safety and Bias.
  • There is another dictionary used to instruct the LLM in the criteria they are using to evaluate the image, by providing a few-shot learning approach
  • This data structure aims to balance human readability in interpretation of the data with effective prompting for the LLM, as we will see in the example below
  • These dictionaries are defined globally within the helper_functions.py file so they can be called from the juptyer notebook within the evaluation flow.


2.2 Evaluation Functions?

get_vision_completion():

  • This function serves as a base template for the vision evals. It takes an image as a base64 string, along with a user prompt, and passes that to the Claude vision model and returns the response from Claude??
  • We’ll call this from within the jupyter notebook to assign the image / base64 string, as well as the prompt

evaluate_image_for_content_with_examples():

  • This function expands the previous one to work through the content checks dictionary.?
  • It includes a few-shot style prompting approach where the instructions from the content_check_examples dictionary are passed into Claude before asking it to evaluate the image for those examples.
  • From a data hygiene standpoint, we also instruct Claude to respond only with a YES or NO to ensure consistency in all responses and be able to filter the results later based on binary classifications
  • We then return Claude’s response as the output of the function


There is also a utility function (not shown above), process_image(), to load images and prepare them to be converted to a base64 string. It’s important to ensure consistency of file formats between image loading, base64 string conversion, and the media_type argument passed to the vision model within the API call above.?

3) Executing the Evals

Now, we’re ready to dive into the jupyter notebook to run the core evals loop and pass that to Arize, our monitoring platform. Let’s walk through the code in evalution_notebook.ipynb:

First, you’re going to see the basic steps of importing libraries and data files. We’re loading pre-generated images from a training run of the Scenes From Tomorrow project for climate change education, which you can read more details about here if you are interested. We’ll load a dataframe with metadata, a unique ID, and a timestamp of generation for each image. We’ll revisit these results again at the end of the workflow to see how to identify images that failed the content guardrails checks.?

3.1 Core Evals Workflow

In this cell we call the functions defined above within an execution block to generate a dataframe containing the evaluation results for each of the 100 images:

  • Load the image and convert to base64 string
  • Use the image filename to retrieve the metadata
  • Call the get_vision_completion() and evaluate_image_for_content_with_examples() functions to?
  • We initially store all the results in a dictionary with nested lists for the content checks and then in the next cell convert to a pandas dataframe and unpack the nested columns
  • You will notice some sleep() commands included to prevent API rate limiting. Depending on your plan you may not need these. The loop will take around ~15 mins to run for the 100 images included in the repo.



4) Stage to our Evaluation Platform

The next few cells send prepare the data schema and push the evaluations to Arize for storage and visualization. These cells are based on the Arize documentation here for connection set-up and here for column schema set-up. Note that the schema in this notebook for vision model content evaluation case is adapted from one designed for binary classification. The column names have been force-fitted a bit, which you’d want to clean up and standardize before putting this into an ongoing process.

5) Evaluate & Visualize

Finally, we’ll look at the results. We can do this locally to start as a gut-check on the workflow and then check out dashboards in Arize.

  • First, locally using a panda operation we can filter the dataframe. The approach we defined earlier uses a YES result to indicate a ‘fail’ of the content checks and a NO result to indicate a pass. Then, we include a quick code snippet to iterate through that filtered dataframe and display those images along?
  • You will see the image in question showed an urban environment (Busan, South Korea) but with a prompt to generate images of wildfires as part of showing the negative impacts of climate change in a 3.0°C warming scenario. The contradiction of these two things led the image model (Stable Diffusion XL Turbo) to create an image showing an urban landscape suffering from fire and devastation.?

  • The fact this image was flagged as a ‘fail’ is encouraging as it validates that this workflow could be useful!
  • The corrective action I took was to change the image generation prompt to focus more on landscape images, and reduce the risk of creating images that seemed to represent scenes of violence in urban settings.?



Next let’s take a look at Arize. Using the built-in Arize dashboard features I published a very rough POC of what a monitoring dashboard would look like for a production application where we have a visual record of when content violations occurred. In an actual application, we’d also want to orchestrate additional alerts and notifications to be triggered from within the application to ensure the right escalation and resolution is put in place.?

This dashboard filters for 'YES' values, to identify a violation of a content check, giving us a glanceable way to determine when the event occurred.

Final Thoughts

Technical Performance and Augmenting Human Moderation

It takes around 10 seconds to evaluate each image. Assuming that a human moderator needs a minimum of around 1-2 minutes to evaluate a piece of content (especially across multiple dimensions), this workflow could significantly scale up human moderators ability to evaluate content. Considering the exponential scale at which synthetic content is likely to grow over the next few years, such an improvement seems critical.


Implications for Business and Content Safety

Looking forward, it’s easy to imagine a future where industry-required third party verification of AI generated images becomes an essential pillar to maintain consumer trust. The digital media industry offers an example here with industry associations that adopt common standards (such as the IAB, Internet Advertising Bureau), along with 3rd party partners such as DoubleVerify who provide verification of campaign delivery within agreed-upon benchmarks.


Hopefully this is helpful and please let me know what you think in the comments!

要查看或添加评论,请登录

Alexander Liss的更多文章

社区洞察

其他会员也浏览了