What can GPT-4V do?
By now, you've probably heard of GPT-4V -- the upgraded version of GPT-4 that can process images as well as text. It's not a separate image recognition system -- instead it tokenizes images and processes them using the same pipeline that GPT-4 does for text. It's pretty neat!
So, I got access and tried a few vision tasks.
First, here's my grading scale:
Example
Now, the grades for each task.
Identifying people and things:
This is where it really shines.
If you have a thing you want to identify, GPT-4V is the right tool for the job. It can identify an incredible variety of common objects (including plants and animals) with stunning accuracy. It even does pretty well on objects from reddit.com/r/whatisthisthing, a forum where humans post objects they can't recognize!
Listing items in groups:
Consistently, with a large group of objects like a shelf of books, it misses about 10-20% of the objects. Occasionally it will hallucinate extra items. For example, on a picture of my bookshelf of ~20 books, it invented a few new books -- making the result useless.
I gave "Flowers in a Bouquet" an A+, even though it missed a few flowers, because it knew the names of some other flowers that I didn't know!
Describing people:
This is a bit tricky because the AI is "censored" so as not to perpetuate stereotypes. But if you convince it the person is fictional, it will describe the person. It correctly describes face shape, hair, eyes, clothes, makeup, jewelry, etc., with very occasional glitches. Overall, this is a grade A.
This is perhaps the most commercially useful feature of GPT-4V. Imagine looking at a picture of a person, making some educated guesses about their background, and marketing to them appropriately! But of course this is the most ethically dangerous territory as well.
领英推荐
Reading text (OCR):
This was the most astonishing aspect of GPT-4V to me. I'm used to OCR systems that can do a great job on typed text, but can't handle text in the real world. GPT-4V can read altered, distorted, messy, cursive, or heavily stylized English text with almost perfect reliability. It's really surprising that it does this well on a task that it's not even designed for.
(I did some tests on Chinese characters too though, and it couldn't understand most typewritten Chinese.)
Spatial relationships:
It's obvious to me that GPT-4V has almost no sense of spatial relationships. It can "see" the objects in an image, but it doesn't know how they relate to each other in 2D or 3D. It's very weird behavior, since it does so well at many other tasks.
Creative tasks:
It did each of these tasks about as well as I would. (I'm no fashionista!)
I've tweeted almost all these experiments with screenshots over the past few days. If you'd like to see samples, let me know.
Conclusion
Overall, GPT-4V has an almost spooky ability to describe individual people, animals, or things. It can read incredibly messy handwriting, talk about a picture of a person, or explain a work of art.
But it's hilariously bad at spatial relationships, or images with a lot of distinct objects. If you give it a map it will get lost. If you give it a shelf of books, it will forget to list many of the books. It can't play Where's Waldo with a greatly cropped page that makes it very obvious where Waldo is. It can't read an analog clock, which is a task we teach to small children.
Is this what you expected from AI? Is there anything else you want to try? And where do you think it will go in the future?
Let me know in the comments!
I'm guessing from what you've seen that it's also pretty bad at interpreting things like graphs?