Unveiling AI's Potential: Building a Visual Question Answering App with Gradio and Transformers
Venugopal Adep
AI Leader | General Manager at Reliance Jio | LLM & GenAI Pioneer | AI Evangelist
In my latest project, I've ventured into the fascinating realm of AI, exploring how vision and language models can work together to answer questions about images. I'll take you through my journey of building a Visual Question Answering application using Gradio and Hugging Face's Transformers. This article details the process, from installing Gradio to deploying a user-friendly web interface, demonstrating the ease and efficiency of creating powerful AI tools. Join me in discovering how these cutting-edge technologies are making AI more accessible and interactive than ever before.
Install and import libraries
This code snippet installs a specific version (4.5.0) of Gradio, a Python library for building machine learning web apps, quietly without showing output. It then imports Gradio, the VILT (Vision-and-Language Transformer) processor and model for question answering tasks from Hugging Face's transformers, and the Python Imaging Library (PIL) for image processing.
!pip install gradio==4.5.0 -q
import gradio as gr
from transformers import ViltProcessor, ViltForQuestionAnswering
from PIL import Image
Load the model
This code initializes a VILT processor and model specifically fine-tuned for visual question answering tasks by loading them once from a pre-trained configuration. This is done outside a function to optimize performance and prevent reloading them with every function call.
# Load the processor and model outside of the function to avoid reloading them each time the function is called
processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
Answer Generation
The answer_question function takes an image and a question as inputs, processes them using a pre-trained VILT processor to prepare for inference, and then uses a VILT model to predict the answer. It performs the prediction, retrieves the answer with the highest probability from the logits, and translates the answer ID to a human-readable format using the model's configuration, finally returning this answer.
def answer_question(image, question):
# Process the image and question
inputs = processor(images=image, text=question, return_tensors="pt", padding=True)
# Perform the inference
outputs = model(**inputs)
# Extract the predicted answer
logits = outputs.logits
answer_id = logits.argmax(-1).item()
answer = model.config.id2label[answer_id]
return answer
Create Gradio Interface
The code defines a Gradio web interface, iface, for the visual question answering application, specifying answer_question as the function to call with an image and a question as inputs and a textbox for the output. It then launches this interface, making it accessible as a web application with a user-friendly GUI for users to upload images, ask questions, and receive AI-generated answers.
# Define the Gradio interface
iface = gr.Interface(
fn=answer_question,
inputs=[gr.Image(type="pil"), gr.Textbox(label="Question")],
outputs=gr.Textbox(label="Answer"),
title="Visual Question Answering",
description="Upload an image and ask a question related to the image. The AI will try to answer it."
)
# Launch the interface
iface.launch()