Using LLM's to describe images
Devoxx UK 2024 photo from Flickr

Using LLM's to describe images

I've already worked on face recognition many years ago, so the natural next step is to use a Large Language Model (LLM) to describe what's in the photo ??

With both local & cloud based LLM's this is actually a very, VERY EASY!

First, here's the description of the above image using a local LLM:

"I see an image of a person sitting at a table with a sign that reads "WE'RE HIRING." The individual appears to be resting their head on one hand, and there are various items on the table including what looks like paperwork, possibly related to job applications or hiring. The background shows a banner with additional text that includes "highlights," "PROBLEM SOLVERS," and some contact information, suggesting this might be at a recruitment event or fair where companies are showcasing their open positions."

The LLM I used is the LLaVa multimodal model :

?? LLaVA is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding. Updated to version 1.6.

New in LLaVA 1.6:
* Increasing the input image resolution to up to 4x more pixels, supporting 672x672, 336x1344, 1344x336 resolutions.
* Better visual reasoning and OCR capability with an improved visual instruction tuning data mixture.
* Better visual conversation for more scenarios, covering different applications.
* Better world knowledge and logical reasoning.        

On my MacBook Pro it takes approximately 4 seconds to analyse an image of 2048 by 1366 px using Java 21 and LangChain4J 0.31.0.

How is it done?

3 simple Java statements ??

  1. Create a ChatModel using a local Ollama instance and using the LLaVa model
  2. Create your user message with the prompt and image URL
  3. Execute the user prompt using the generate method

That's it

private OllamaChatModel ollamaChatModel = OllamaChatModel
    .builder()
    .modelName("llava:latest")
    .baseUrl("https://localhost:11434")
    .maxRetries(3)
    .build();

UserMessage userMessage = UserMessage.from(
    TextContent.from("What do you see?"),
    ImageContent.from(imageUrl)
);

Response<AiMessage> generate = ollamaChatModel.generate(userMessage);        

GPT-4o

This is what OpenAI GPT-4o gives using LangChain4J for the same photo:

The image shows a woman sitting at a table in front of a blue backdrop with the words "We're hiring" written in large letters. She is wearing a dark-colored shirt and holding a phone. On the table, there are several white water bottles and some informational pamphlets or cards. The backdrop also has additional text that includes words like "Highflyers," "Problem Solvers," and possibly more. The setting appears to be a job fair or recruitment event.

You do need an API key for this cloud based model.

private OpenAiChatModel openAIChatModel = OpenAiChatModel     
  .builder()     
  .modelName("gpt-4o")     
  .apiKey("xxxx")     
  .maxRetries(3)     
  .build();  

UserMessage userMessage = UserMessage.from(     
   TextContent.from("What do you see?"),     
   ImageContent.from(imageUrl) 
);  

Response<AiMessage> generate = openAIChatModel.generate(userMessage);           

What's interesting to see is when I ask ChatGPT using GPT-4o I get a much more detailed description, probably because OpenAI does some extra prompt magic as the system message?


"This photo captures a scene at a hiring or recruitment event. Here's a breakdown of the details:

  1. Setting: The backdrop prominently displays the words "We're hiring," indicating that the event is focused on recruitment. The background is blue with white text, making the message stand out clearly.
  2. Person: A woman with long blonde hair is seated, holding a smartphone. She is wearing a dark blue T-shirt with the logo and name "Auto Trader" along with a badge or lanyard that might identify her as a representative of the company.
  3. Table Setup: On the table in front of her, there are several white water bottles, flyers, and business cards. This setup suggests that the table is likely a booth where the company is distributing promotional materials and information about job opportunities.
  4. Additional Text: Below the primary "We're hiring" message, other encouraging terms like "Highflyers," "Innovators," and "Problem Solvers" are visible. These terms are likely used to attract potential candidates who identify with these qualities.

Overall, the image conveys a professional and organized recruitment effort by Auto Trader, aimed at attracting talented individuals to join their team."


Here are some more results using the local LLaVa model:

Devoxx UK 2024
DevoxxUK 2024

The next step is doing semantic search in combination with the speaker recognition. so I can type : "Give me all photos where the Devoxx letters are visible with Venkat speaking" ??

Good stuff!

PS: Please let me know which other local multimodal models I should try-out.

Zeshan Abdullah

Web Developer at Fiverr

4 个月

Flickr Downloader is a free tool for downloading HD quality images from Flickr.com https://hdstockimages.com/flickr-downloader/

回复

要查看或添加评论,请登录

Stephan Janssen的更多文章

  • 10K+ Downloads Milestone for DevoxxGenie!

    10K+ Downloads Milestone for DevoxxGenie!

    I'm excited to share that DevoxxGenie has hit a major milestone: over 10,000 downloads! The actual number is likely…

    2 条评论
  • Running the full DeepSeek R1 model at Home or in the Cloud?

    Running the full DeepSeek R1 model at Home or in the Cloud?

    The DeepSeek R1 model, a massive 671B parameter Mixture-of-Experts (MoE) model, demands significant computational…

    7 条评论
  • Large Language Models related (study) material

    Large Language Models related (study) material

    This week I spoke at VoxxedDays CERN and Ticino (including a keynote). Received lots of great feedback but also several…

  • LLM Inference using 100% Modern Java ????

    LLM Inference using 100% Modern Java ????

    In the rapidly evolving world of (Gen)AI, Java developers now have powerful new (LLM Inference) tools at their…

    5 条评论
  • Basketball Game Analysis using an LLM

    Basketball Game Analysis using an LLM

    I asked OpenAI's ChatGPT and Google Gemini to analyze some game snapshots, and it's incredible how well they break down…

    5 条评论
  • The Power of Full Project Context #LLM

    The Power of Full Project Context #LLM

    I've tried integrating RAG into the DevoxxGenie plugin, but why limit myself to just some parts found through…

    14 条评论
  • Devoxx Genie Plugin : an Update

    Devoxx Genie Plugin : an Update

    When I invited Anton Arhipov from JetBrains to present during the Devoxx Belgium 2023 keynote their early Beta AI…

    1 条评论
  • MLX on Apple silicon

    MLX on Apple silicon

    "MLX is an array framework for machine learning on Apple silicon, brought to you by Apple machine learning research…

    1 条评论
  • Streamlining Your IDE with a Local LLM AI Assistant: A Quick Guide

    Streamlining Your IDE with a Local LLM AI Assistant: A Quick Guide

    The current "AI Assistant" plugin for IntelliJ operates exclusively online, as it leverages a cloud-based GPT-4…

    6 条评论
  • Jlama : LLM meets Java (Vector)

    Jlama : LLM meets Java (Vector)

    Jlama is the first #LLM project I've come across which is entirely developed in #Java, leveraging the jdk.incubator.

    3 条评论

社区洞察

其他会员也浏览了