Using LLM's to describe images
I've already worked on face recognition many years ago, so the natural next step is to use a Large Language Model (LLM) to describe what's in the photo ??
With both local & cloud based LLM's this is actually a very, VERY EASY!
First, here's the description of the above image using a local LLM:
"I see an image of a person sitting at a table with a sign that reads "WE'RE HIRING." The individual appears to be resting their head on one hand, and there are various items on the table including what looks like paperwork, possibly related to job applications or hiring. The background shows a banner with additional text that includes "highlights," "PROBLEM SOLVERS," and some contact information, suggesting this might be at a recruitment event or fair where companies are showcasing their open positions."
The LLM I used is the LLaVa multimodal model :
?? LLaVA is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding. Updated to version 1.6.
New in LLaVA 1.6:
* Increasing the input image resolution to up to 4x more pixels, supporting 672x672, 336x1344, 1344x336 resolutions.
* Better visual reasoning and OCR capability with an improved visual instruction tuning data mixture.
* Better visual conversation for more scenarios, covering different applications.
* Better world knowledge and logical reasoning.
On my MacBook Pro it takes approximately 4 seconds to analyse an image of 2048 by 1366 px using Java 21 and LangChain4J 0.31.0.
How is it done?
3 simple Java statements ??
That's it
private OllamaChatModel ollamaChatModel = OllamaChatModel
.builder()
.modelName("llava:latest")
.baseUrl("https://localhost:11434")
.maxRetries(3)
.build();
UserMessage userMessage = UserMessage.from(
TextContent.from("What do you see?"),
ImageContent.from(imageUrl)
);
Response<AiMessage> generate = ollamaChatModel.generate(userMessage);
GPT-4o
This is what OpenAI GPT-4o gives using LangChain4J for the same photo:
领英推荐
The image shows a woman sitting at a table in front of a blue backdrop with the words "We're hiring" written in large letters. She is wearing a dark-colored shirt and holding a phone. On the table, there are several white water bottles and some informational pamphlets or cards. The backdrop also has additional text that includes words like "Highflyers," "Problem Solvers," and possibly more. The setting appears to be a job fair or recruitment event.
You do need an API key for this cloud based model.
private OpenAiChatModel openAIChatModel = OpenAiChatModel
.builder()
.modelName("gpt-4o")
.apiKey("xxxx")
.maxRetries(3)
.build();
UserMessage userMessage = UserMessage.from(
TextContent.from("What do you see?"),
ImageContent.from(imageUrl)
);
Response<AiMessage> generate = openAIChatModel.generate(userMessage);
What's interesting to see is when I ask ChatGPT using GPT-4o I get a much more detailed description, probably because OpenAI does some extra prompt magic as the system message?
"This photo captures a scene at a hiring or recruitment event. Here's a breakdown of the details:
Overall, the image conveys a professional and organized recruitment effort by Auto Trader, aimed at attracting talented individuals to join their team."
Here are some more results using the local LLaVa model:
The next step is doing semantic search in combination with the speaker recognition. so I can type : "Give me all photos where the Devoxx letters are visible with Venkat speaking" ??
Good stuff!
PS: Please let me know which other local multimodal models I should try-out.
Web Developer at Fiverr
4 个月Flickr Downloader is a free tool for downloading HD quality images from Flickr.com https://hdstockimages.com/flickr-downloader/