The Next Generation of Document Classification: Exploring Vision Language Models
Bharani Srinivas
Senior Manager - AIML Delivery Lead | Machine Learning | R&D | Mentor at Great Lakes
Document classification is a fundamental task in the field of information retrieval and natural language processing. It involves categorizing documents into predefined classes based on their content. This process is crucial for managing, sorting, and retrieving the vast amounts of data generated daily in various sectors.
In the banking sector, document classification plays a pivotal role in intelligent document processing. Banks deal with a plethora of documents daily, such as loan applications, customer KYC forms, transaction records, and more. Manual processing of these documents is time-consuming and prone to errors. Automated document classification can help in sorting these documents efficiently, speeding up processing times, reducing errors, and improving customer service.
Similarly, in the healthcare sector, document classification is of paramount importance. Hospitals and healthcare providers handle a wide variety of documents, including patient records, medical reports, insurance forms, and prescription notes. Efficient document classification can aid in the quick retrieval of patient information, streamline insurance claim processing, and ensure the proper tracking of prescriptions and medical reports.
By leveraging advanced techniques like vision language models, document classification can be made more accurate and efficient, paving the way for more intelligent document processing systems. This not only improves operational efficiency but also enhances decision-making and strategic planning in these sectors. Thus, document classification serves as a cornerstone of digital transformation in industries like banking and healthcare.
How Document Classification Works in Machine Learning:
Limitations of Current Methods:
These limitations highlight the need for more advanced techniques like vision language models that can understand context, handle unstructured data, work with multiple languages, scale efficiently, and work with less labeled data. These models hold great promise for the future of document classification.
Vision Language Models (VLMs) are a new frontier in the field of artificial intelligence that combines the strengths of computer vision and natural language processing. These models are designed to understand and generate information from both visual (images, videos) and textual data.
Here’s a high-level overview of how they work:
The power of VLMs lies in their ability to understand the intricate interplay between visual and textual data, which allows them to outperform models that only use text or vision. They can handle unstructured data, understand context, work with multiple languages, and scale efficiently, making them a promising tool for tasks like document classification. However, like all models, they are not without their limitations and challenges, which include computational cost, data privacy, and the need for diverse and representative training data. Despite these challenges, the potential of VLMs in transforming industries like banking and healthcare is immense.
Using Vision Language Models with transformers
You can infer with Llava using the LlavaNext model as shown below.
Let’s initialize the model and the processor first.
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
"llava-hf/llava-v1.6-mistral-7b-hf",
torch_dtype=torch.float16,
low_cpu_mem_usage=True
)
model.to(device)
We now pass the image and the text prompt to the processor, and then pass the processed inputs to the generate. Note that each model uses its own prompt template, be careful to use the right one to avoid performance degradation.
from PIL import Image
import requests
url = "Image path"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
inputs = processor(prompt, image, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=100)
Call decode to decode the output tokens.
print(processor.decode(output[0], skip_special_tokens=True))
领英推荐
Case Studies/Examples
We tried different document types ranging from news article, Hotel receipt to a bank form where we can see the model has recognized all of them correctly without any additional retraining.
With further improving the prompting or few shot learning we may achieve better results.
For example instead of prompting
"What is the document type?"
We can ask
"Which category the document belongs to choose from the below list ["Income Asset Form", "Account Opening Form"] and specify it as other if the document type is not available in the list"
Vision Language Models (VLMs) have shown great promise in document classification. However, there’s always room for improvement and further research. Here are some potential areas to explore:
Conclusion
In conclusion, Vision Language Models (VLMs) have ushered in a new era in the field of document classification. Their ability to understand and interpret visual cues in text has significantly improved the accuracy and efficiency of document classification tasks. However, as with any technology, there is always room for improvement and innovation. Future research could focus on areas such as multimodal learning, transfer learning, interpretability, optimization of model architecture, noise robustness, and real-time processing. As we continue to explore and innovate, the potential of VLMs in document classification is bound to reach new heights. The journey of illuminating the future of document classification with VLMs is just beginning, and it promises to be an exciting and transformative one.
Congrats Bharani Srinivas