GPT-4 Accepts Image Inputs, Here’s What That Means for IDP

GPT-4 Accepts Image Inputs, Here’s What That Means for IDP

OpenAI, the artificial intelligence (AI) research laboratory behind popular generative AI tools DALL-E and ChatGPT, just announced GPT-4. In casual conversation, the company says the newest iteration of its generative pre-trained transformer (GPT) language model is only subtly different from last year’s GPT-3.5. However, as the tasks thrown at GPT-4 become increasingly complex, contrast with the older model becomes more stark.

To demonstrate this, OpenAI researchers use a variety of benchmarks including exams originally designed to test human knowledge across various subjects (e.g., AP Calculus BC, Uniform Bar Exam, GRE Writing, LSAT, etc.). Unsurprisingly, GPT-4 outperforms GPT-3.5 in most instances. What’s even more intriguing is that another new feature of the model—its ability to accept image inputs—leads to even greater performance gains.

GPT-4 is is multimodal, which means it accepts different modalities of data. Specifically, the model is capable of generating text outputs, including natural language and code, from inputs that contain a combination of text and images. According to OpenAI, it has shown similar levels of proficiency on a diverse range of domains, including documents containing text, photographs, diagrams, or screenshots, as it does on inputs that only contain text. Conversely, GPT-3.5 was limited to one modality: text.

No alt text provided for this image
Exam results from the GPT-4 Technical Report.

For some who were anticipating a multimodal GPT-4 that would support audio and video inputs (and potentially diverse forms of output), this development might be viewed as disappointing. But as a company primarily focused on Intelligent Document Processing (IDP), we’re incredibly excited. We’ve written about ChatGPT and the future of IDP before, where we speculated that Large Language Models (LLMs) had the potential to improve data extraction accuracy, respond to natural language queries about critical business information, and simplify the creation of AI applications. The new information revealed about GPT-4 reinforces the potential of those ideas.

In the GPT-4 developer livestream, OpenAI demonstrated the model’s potential when it comes to documents. For example, Greg Brockman, President and Co-Founder of OpenAI, showed how GPT-4 could not only extract information from a rough, hand-drawn website mockup, but take it a step further by converting it into working HTML. Not only does this showcase an impressive ability to recognize handwritten characters, but it also demonstrates an understanding of context and intent that has broad applicability in document processing and analysis.

No alt text provided for this image
A demonstration of GPT-4 converting a handwritten note into working HTML.

Riding the exponential wave

Moore’s law is an observation that the number of transistors on a microchip doubles every two years, meaning the speed and capability of our computers will increase every two years and the cost will decrease. In comparison, AI is seeing a doubling every 3.5 months. This is why we built a platform to harness the best models, rather than attempt to build and maintain proprietary ones (and compete with industry giants like Microsoft and Amazon). Super.AI customers get rapid access to new models like GPT-4, optimized for processing complex documents.

Expect to see:

  • Improved zero- and few-shot learning. Limited resources and data shouldn’t stall automation initiatives, and testing indicates generative pre-trained models like GPT-4 will make it possible to do more with less data and training.
  • Increased automation rates. GPT-4 is capable of augmenting or replacing entirely leading optical character recognition (OCR) models due to better extraction accuracy. Achieve higher automation rates in less time.
  • Complex question answering. GPT-4 is capable of processing long-form text (up to 25k words), building upon ChatGPT’s question answering abilities. Scour massive document datasets and answer complex questions about them posed in natural language.
  • Document summarization. Save time reading through long, complex documents searching for key information and instead simply ask GPT-4 to provide a summary for you.
  • Document classification. Organizing massive document stores is a complex undertaking even with advanced machine learning techniques. GPT-4’s massive training dataset makes it capable of accurate classification of varied document, vastly improving information retrieval.
  • Faster AI app development. OpenAI showcased GPT-4’s ability to write and troubleshoot code during its developer livestream. This functionality can be used to write custom AI data programs tailored to the unique needs of your business. We are also developing a future “prompt builder” that will allow users to build new data programs using natural language.

These are just some of the initial ideas we have for enhanced functionality GPT-4 can bring to the super.AI platform. In the coming weeks and months we will continue to discover new use cases for the latest and greatest large language models. Stay tuned for additional updates, and don’t hesitate to reach out if you’re interested in learning more about how our platform can benefit your document automation use case.

A. R.

Sharing what I've been through to help you get a better life. Things get better when we learn from each other, did you know that?

1 年

super.AI The introduction of GPT-4, a multimodal AI model capable of accepting image inputs, holds great promise as it has demonstrated comparable skills across various domains, including text, photographs, diagrams, or screenshots. This opens up exciting possibilities for Intelligent Document Processing (IDP) and beyond, despite its limited availability at present. ?? ??

回复

要查看或添加评论,请登录

super.AI的更多文章

社区洞察

其他会员也浏览了