GPT-4 Accepts Image Inputs, Here’s What That Means for IDP
OpenAI, the artificial intelligence (AI) research laboratory behind popular generative AI tools DALL-E and ChatGPT, just announced GPT-4. In casual conversation, the company says the newest iteration of its generative pre-trained transformer (GPT) language model is only subtly different from last year’s GPT-3.5. However, as the tasks thrown at GPT-4 become increasingly complex, contrast with the older model becomes more stark.
To demonstrate this, OpenAI researchers use a variety of benchmarks including exams originally designed to test human knowledge across various subjects (e.g., AP Calculus BC, Uniform Bar Exam, GRE Writing, LSAT, etc.). Unsurprisingly, GPT-4 outperforms GPT-3.5 in most instances. What’s even more intriguing is that another new feature of the model—its ability to accept image inputs—leads to even greater performance gains.
GPT-4 is is multimodal, which means it accepts different modalities of data. Specifically, the model is capable of generating text outputs, including natural language and code, from inputs that contain a combination of text and images. According to OpenAI, it has shown similar levels of proficiency on a diverse range of domains, including documents containing text, photographs, diagrams, or screenshots, as it does on inputs that only contain text. Conversely, GPT-3.5 was limited to one modality: text.
For some who were anticipating a multimodal GPT-4 that would support audio and video inputs (and potentially diverse forms of output), this development might be viewed as disappointing. But as a company primarily focused on Intelligent Document Processing (IDP), we’re incredibly excited. We’ve written about ChatGPT and the future of IDP before, where we speculated that Large Language Models (LLMs) had the potential to improve data extraction accuracy, respond to natural language queries about critical business information, and simplify the creation of AI applications. The new information revealed about GPT-4 reinforces the potential of those ideas.
In the GPT-4 developer livestream, OpenAI demonstrated the model’s potential when it comes to documents. For example, Greg Brockman, President and Co-Founder of OpenAI, showed how GPT-4 could not only extract information from a rough, hand-drawn website mockup, but take it a step further by converting it into working HTML. Not only does this showcase an impressive ability to recognize handwritten characters, but it also demonstrates an understanding of context and intent that has broad applicability in document processing and analysis.
领英推荐
Riding the exponential wave
Moore’s law is an observation that the number of transistors on a microchip doubles every two years, meaning the speed and capability of our computers will increase every two years and the cost will decrease. In comparison, AI is seeing a doubling every 3.5 months. This is why we built a platform to harness the best models, rather than attempt to build and maintain proprietary ones (and compete with industry giants like Microsoft and Amazon). Super.AI customers get rapid access to new models like GPT-4, optimized for processing complex documents.
Expect to see:
These are just some of the initial ideas we have for enhanced functionality GPT-4 can bring to the super.AI platform. In the coming weeks and months we will continue to discover new use cases for the latest and greatest large language models. Stay tuned for additional updates, and don’t hesitate to reach out if you’re interested in learning more about how our platform can benefit your document automation use case.
Sharing what I've been through to help you get a better life. Things get better when we learn from each other, did you know that?
1 年super.AI The introduction of GPT-4, a multimodal AI model capable of accepting image inputs, holds great promise as it has demonstrated comparable skills across various domains, including text, photographs, diagrams, or screenshots. This opens up exciting possibilities for Intelligent Document Processing (IDP) and beyond, despite its limited availability at present. ?? ??