登录查看更多内容

Using LLM's to describe images

Stephan Janssen

发布日期: 2024年6月6日

I've already worked on face recognition many years ago, so the natural next step is to use a Large Language Model (LLM) to describe what's in the photo ??

With both local & cloud based LLM's this is actually a very, VERY EASY!

First, here's the description of the above image using a local LLM:

"I see an image of a person sitting at a table with a sign that reads "WE'RE HIRING." The individual appears to be resting their head on one hand, and there are various items on the table including what looks like paperwork, possibly related to job applications or hiring. The background shows a banner with additional text that includes "highlights," "PROBLEM SOLVERS," and some contact information, suggesting this might be at a recruitment event or fair where companies are showcasing their open positions."

The LLM I used is the LLaVa multimodal model :

?? LLaVA is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding. Updated to version 1.6.

New in LLaVA 1.6:
* Increasing the input image resolution to up to 4x more pixels, supporting 672x672, 336x1344, 1344x336 resolutions.
* Better visual reasoning and OCR capability with an improved visual instruction tuning data mixture.
* Better visual conversation for more scenarios, covering different applications.
* Better world knowledge and logical reasoning.

On my MacBook Pro it takes approximately 4 seconds to analyse an image of 2048 by 1366 px using Java 21 and LangChain4J 0.31.0.

How is it done?

3 simple Java statements ??

Create a ChatModel using a local Ollama instance and using the LLaVa model
Create your user message with the prompt and image URL
Execute the user prompt using the generate method

That's it

private OllamaChatModel ollamaChatModel = OllamaChatModel
    .builder()
    .modelName("llava:latest")
    .baseUrl("https://localhost:11434")
    .maxRetries(3)
    .build();

UserMessage userMessage = UserMessage.from(
    TextContent.from("What do you see?"),
    ImageContent.from(imageUrl)
);

Response<AiMessage> generate = ollamaChatModel.generate(userMessage);

GPT-4o

This is what OpenAI GPT-4o gives using LangChain4J for the same photo:

领英推荐

Latest Advancements in RAG Every Developer Should Know!

Pavan Belagatti 1 年前

The Knowledge-Based AI Era (1980-2000): A…

Mohan Kumar 1 个月前

AI Prompt Mastery: Learn Science-backed Techniques for…

TEAM International 9 个月前

The image shows a woman sitting at a table in front of a blue backdrop with the words "We're hiring" written in large letters. She is wearing a dark-colored shirt and holding a phone. On the table, there are several white water bottles and some informational pamphlets or cards. The backdrop also has additional text that includes words like "Highflyers," "Problem Solvers," and possibly more. The setting appears to be a job fair or recruitment event.

You do need an API key for this cloud based model.

private OpenAiChatModel openAIChatModel = OpenAiChatModel     
  .builder()     
  .modelName("gpt-4o")     
  .apiKey("xxxx")     
  .maxRetries(3)     
  .build();  

UserMessage userMessage = UserMessage.from(     
   TextContent.from("What do you see?"),     
   ImageContent.from(imageUrl) 
);  

Response<AiMessage> generate = openAIChatModel.generate(userMessage);

What's interesting to see is when I ask ChatGPT using GPT-4o I get a much more detailed description, probably because OpenAI does some extra prompt magic as the system message?

"This photo captures a scene at a hiring or recruitment event. Here's a breakdown of the details:

Setting: The backdrop prominently displays the words "We're hiring," indicating that the event is focused on recruitment. The background is blue with white text, making the message stand out clearly.
Person: A woman with long blonde hair is seated, holding a smartphone. She is wearing a dark blue T-shirt with the logo and name "Auto Trader" along with a badge or lanyard that might identify her as a representative of the company.
Table Setup: On the table in front of her, there are several white water bottles, flyers, and business cards. This setup suggests that the table is likely a booth where the company is distributing promotional materials and information about job opportunities.
Additional Text: Below the primary "We're hiring" message, other encouraging terms like "Highflyers," "Innovators," and "Problem Solvers" are visible. These terms are likely used to attract potential candidates who identify with these qualities.

Overall, the image conveys a professional and organized recruitment effort by Auto Trader, aimed at attracting talented individuals to join their team."

Here are some more results using the local LLaVa model:

The next step is doing semantic search in combination with the speaker recognition. so I can type : "Give me all photos where the Devoxx letters are visible with Venkat speaking" ??

Good stuff!

PS: Please let me know which other local multimodal models I should try-out.

Zeshan Abdullah

Web Developer at Fiverr

4 个月

Flickr Downloader is a free tool for downloading HD quality images from Flickr.com https://hdstockimages.com/flickr-downloader/

要查看或添加评论，请登录

Stephan Janssen的更多文章

10K+ Downloads Milestone for DevoxxGenie!

2025年1月31日

10K+ Downloads Milestone for DevoxxGenie!

I'm excited to share that DevoxxGenie has hit a major milestone: over 10,000 downloads! The actual number is likely…

2 条评论
Running the full DeepSeek R1 model at Home or in the Cloud?

2025年1月29日

Running the full DeepSeek R1 model at Home or in the Cloud?

The DeepSeek R1 model, a massive 671B parameter Mixture-of-Experts (MoE) model, demands significant computational…

7 条评论
Large Language Models related (study) material

2025年1月19日

Large Language Models related (study) material

This week I spoke at VoxxedDays CERN and Ticino (including a keynote). Received lots of great feedback but also several…
LLM Inference using 100% Modern Java ????

2024年10月21日

LLM Inference using 100% Modern Java ????

In the rapidly evolving world of (Gen)AI, Java developers now have powerful new (LLM Inference) tools at their…

5 条评论
Basketball Game Analysis using an LLM

2024年9月10日

Basketball Game Analysis using an LLM

I asked OpenAI's ChatGPT and Google Gemini to analyze some game snapshots, and it's incredible how well they break down…

5 条评论
The Power of Full Project Context #LLM

2024年7月3日

The Power of Full Project Context #LLM

I've tried integrating RAG into the DevoxxGenie plugin, but why limit myself to just some parts found through…

14 条评论
Devoxx Genie Plugin : an Update

2024年5月28日

Devoxx Genie Plugin : an Update

When I invited Anton Arhipov from JetBrains to present during the Devoxx Belgium 2023 keynote their early Beta AI…

1 条评论
MLX on Apple silicon

2023年12月7日

MLX on Apple silicon

"MLX is an array framework for machine learning on Apple silicon, brought to you by Apple machine learning research…

1 条评论
Streamlining Your IDE with a Local LLM AI Assistant: A Quick Guide

2023年11月30日

Streamlining Your IDE with a Local LLM AI Assistant: A Quick Guide

The current "AI Assistant" plugin for IntelliJ operates exclusively online, as it leverages a cloud-based GPT-4…

6 条评论
Jlama : LLM meets Java (Vector)

2023年11月29日

Jlama : LLM meets Java (Vector)

Jlama is the first #LLM project I've come across which is entirely developed in #Java, leveraging the jdk.incubator.

3 条评论

See all articles

Using LLM's to describe images

Stephan Janssen

How is it done?

GPT-4o

领英推荐

Stephan Janssen的更多文章

社区洞察

其他会员也浏览了

ArchiMate, a formal language?

Langchain

How to Use ChatGPT API in Python?

As we say in Python "Hello World!"

Human Capital Management (HCM) - Sentence Similarity Language Model using Java

Small Language Model improves performance for code generation using AI

LangChain Models

How I used NLP (Spacy) to screen Data Science Resumes

Using ChatGPT to Replicate Results of Hidden Markov Modeling

The Impact of AWS Bedrock and LLMs in Fraud Detection: A Comprehensive Overview with a Python Example

How is it done?

GPT-4o

领英推荐

Stephan Janssen的更多文章

10K+ Downloads Milestone for DevoxxGenie!

Running the full DeepSeek R1 model at Home or in the Cloud?

Large Language Models related (study) material

LLM Inference using 100% Modern Java ????

Basketball Game Analysis using an LLM

The Power of Full Project Context #LLM

Devoxx Genie Plugin : an Update

MLX on Apple silicon

Streamlining Your IDE with a Local LLM AI Assistant: A Quick Guide

Jlama : LLM meets Java (Vector)

社区洞察

其他会员也浏览了

ArchiMate, a formal language?

Langchain

How to Use ChatGPT API in Python?

As we say in Python "Hello World!"

Human Capital Management (HCM) - Sentence Similarity Language Model using Java

Small Language Model improves performance for code generation using AI

LangChain Models

How I used NLP (Spacy) to screen Data Science Resumes

Using ChatGPT to Replicate Results of Hidden Markov Modeling

The Impact of AWS Bedrock and LLMs in Fraud Detection: A Comprehensive Overview with a Python Example