AI Series Part VI: Using Images with GPT-4 Vision for RAG (NextJS)

AI Series Part VI: Using Images with GPT-4 Vision for RAG (NextJS)

Hey there! In the previous posts, we covered how to create a RAG chat app using JavaScript and Python servers. At that point, we allowed uploading text files. But as you can imagine, we can use other documents as data sources to store in the vector store. In this post, I’ll show how you can use images as content for your RAG app.

Before we start, let’s create a new project using the NextJS Rag chat app we created in a previous post as a base. So we don't have to make all the OpenAI and LangChain setups again from scratch. Name this new project nextjs-rag-image and make sure it’s up and running before moving forward. If you need, revisit the post where we created this project. Or you can download it from this GitHub repository: https://github.com/soutot/ai-series/tree/main/nextjs-chat-rag

Okay, now we’re on the same page, let’s move forward.

To the Code

For RAG, we need textual representation of data so we can search for semantical similarity content in our store and provide that data to the LLM and get responses. So to achieve this state with images, first we’ll need to somehow convert the image into a textual representation. And for that, we’ll use GPT-4 Vision API. Let’s take a look at the code

First, let’s update the FileUpload.tsx file to allow for uploading images

const ALLOWED_FILE_TYPES = ['.jpg', '.png', '.jpeg']        

Now, inside the api/embed/route.ts we’ll replace the existing code with the following.

Make sure you have the correct imports

import {HNSWLib} from '@langchain/community/vectorstores/hnswlib'
import {HumanMessage, MessageContent} from '@langchain/core/messages'
import {ChatPromptTemplate, SystemMessagePromptTemplate} from '@langchain/core/prompts'
import {ChatOpenAI, OpenAIEmbeddings} from '@langchain/openai'
import {LLMChain} from 'langchain/chains'
import {RecursiveCharacterTextSplitter} from 'langchain/text_splitter'
import {NextResponse} from 'next/server'        

Now create a system prompt to instruct the LLM

const SYSTEM_PROMPT = SystemMessagePromptTemplate.fromTemplate(
  `Your task is to generate a detailed and accurate description of the given image. 
  Write a comprehensive description of the image. Include as many relevant details as possible to provide a vivid and clear understanding of what the image portrays.
  Utilize descriptive language to convey the scene depicted in the image. Use adjectives, nouns, verbs, and adverbs effectively to paint a rich picture in the reader's mind.
  Ensure that your description is accurate and faithful to the content of the image. Avoid making assumptions or adding information that is not evident in the image.
  After writing the initial description, review and revise your text to ensure clarity, coherence, and accuracy. Make any necessary adjustments to enhance the quality of your description.
  `
)        

Then, the POST endpoint can be created, also getting the image file content

export async function POST(request: Request) {
  const data = request.formData()

  const file: File | null = (await data).get('file') as unknown as File
  if (!file) {
    return NextResponse.json({message: 'Missing file input', success: false})
  }
  const buffer = await file.arrayBuffer()
  const base64Image = Buffer.from(buffer).toString('base64')        

Create a message content object that’ll hold the base64 image

const content: Exclude<MessageContent, string> = []

content.push({
  type: 'image_url',
  image_url: {
    url: `data:image/jpeg;base64,${base64Image}`,
  },
})        

Create the prompt that’ll be sent to GPT-4 Vision API along with the image content

const visionPrompt = ChatPromptTemplate.fromMessages([
  SYSTEM_PROMPT,
  new HumanMessage({
    content,
  }),
])        

Initialize the LLM object to allow the API call. Note the model name we're specifying GPT-4 Vision

const visionLLM = new ChatOpenAI({
  temperature: 0,
  openAIApiKey: process.env.OPENAI_API_KEY,
  modelName: 'gpt-4-vision-preview',
  maxTokens: 4096,
})        

Create the Chain with the LLM and Prompt

const visionChain = new LLMChain({
  llm: visionLLM,
  prompt: visionPrompt,
})        

Now make the request to GPT-4 Vision and get the response

const response = await visionChain.invoke({})
const generatedText = response.text        

Split the response into document chunks to prepare it to be embedded

const textSplitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 100,
  })
const splitDocs = await textSplitter.createDocuments(generatedText.split('\n'))        

Finally, embed the documents, store them in the vector store, and return an HTTP response

const embeddings = new OpenAIEmbeddings({
  openAIApiKey: process.env.OPENAI_API_KEY,
})

const vectorStore = await HNSWLib.fromDocuments(splitDocs, embeddings)
await vectorStore.save('vectorstore/rag-store.index')
return new NextResponse(JSON.stringify({success: true}), {
  status: 200,
  headers: {'content-type': 'application/json'},
})        

Cool! At this point we’re already getting the image description from GPT4 Vision, embedding it, and storing it in the vector store. We've got everything ready to use RAG now.

But before we run the app, let’s improve the system prompt so we can tell the LLM it now can receive image descriptions as context. Open the api/route.ts file and edit the QA_PROMPT_TEMPLATE as the following:

const QA_PROMPT_TEMPLATE = `You are an assistant with limited knowledge, only capable of answering questions based on the provided context, which can be either an image description or a text content.
  Your responses must strictly adhere to the information given in the context; refrain from fabricating answers.
  If a question cannot be answered using the context, respond with "I don't know."
  Politely inform users that you're limited to answering questions related to the provided context.
  Context: """"{context}"""
  Question: """{question}"""
  Helpful answer in markdown:`        

Okay, now we’re good to go.

Run the app using pnpm run dev and access it at https://localhost:3000

Upload an image and wait for the process to complete. And now you can ask questions about it and the LLM should be capable of providing answers.

In the example below, I uploaded the image I used for the first post I wrote about AI:

Cool, isn’t it?

If you’re curious, you can console.log or write a file in the system with the GPT-4 response so you can check how it describes the image you’ve uploaded.

One thing to keep in mind is that the GPT-4 Vision API is not as cheap as GPT3.5, so be careful to don’t spend all your resources while playing with this feature.

Issues

As LLMs are constantly improving, we can imagine that they still can't solve all use cases. I noticed some that no LMM/LLM I've tried could solve. I tried GPT-4, Claude-3, Llava, tried other methods like using tesseract, but had no success. The case is getting a precise description of diagrams and flowcharts.

As the models are good at giving an overall description of images, asking them to describe precise data or making sense of linked pieces is still pretty hard. It can describe a simple workflow with pretty high accuracy, but when you start getting something more complex like forks, loops, or wraps, the models can't solve it very well.

Maybe we would need a specialist model that's fine-tunned into understanding diagrams. But right now, as far as I know, this is a limitation that may block you from using this feature for some sort of work. So keep that in mind to either find a better model or another way of dealing with this.

An alternative for now is converting flowcharts to Mermaid or PlantUML. The results I got by using those scripts were satisfying. However, the downside is you may need manual work to perform the conversion and those tools still have their own limitations.

Conclusion

In this post, we learned how to use GPT-4 Vision API and get an image description. Then embedding this description into the vector store and using it to interact with the LLM. All of this is orchestrated by LangChain. We also learned about the limitations of working with some sort of images.

This is just the beginning. You can do a lot more using this feature. And there are also many other ways of generating content into your vector store, like videos, audio, and web pages. I'll try to cover those in the future.

Hope this post was helpful.

See you in the next one.

GitHub code repository: https://github.com/soutot/ai-series/tree/main/nextjs-rag-image

要查看或添加评论,请登录

Tiago Souto的更多文章

社区洞察

其他会员也浏览了