What can GPT-4V do?

What can GPT-4V do?

By now, you've probably heard of GPT-4V -- the upgraded version of GPT-4 that can process images as well as text. It's not a separate image recognition system -- instead it tokenizes images and processes them using the same pipeline that GPT-4 does for text. It's pretty neat!

So, I got access and tried a few vision tasks.

First, here's my grading scale:

  • A+: Better than I could do
  • A: As good as I could do
  • B: Worse than I could do, but still useful
  • C: Not useful, but it is clearly seeing a few things in the image
  • F: Totally useless, it doesn't seem to be understanding the image at all

Example

An example of using GPT-4V to describe some cats

Now, the grades for each task.

Identifying people and things:

This is where it really shines.

  • Dogs: A+ (named each breed)
  • Cats: A
  • Wild animals: A+
  • Plants: A
  • Houses: B
  • Cars: A+
  • Trains: A+
  • Airplanes: A+
  • Electronic components: A+
  • Mystery objects from reddit.com/r/whatisthisthing: A

If you have a thing you want to identify, GPT-4V is the right tool for the job. It can identify an incredible variety of common objects (including plants and animals) with stunning accuracy. It even does pretty well on objects from reddit.com/r/whatisthisthing, a forum where humans post objects they can't recognize!

Listing items in groups:

  • Flowers in a bouquet: A+
  • Listing people in a group of people: C
  • Listing books in a bookshelf: F (it hallucinated several books)
  • Listing food in a full refrigerator: C
  • Listing ingredients in a cooked dish: C

Consistently, with a large group of objects like a shelf of books, it misses about 10-20% of the objects. Occasionally it will hallucinate extra items. For example, on a picture of my bookshelf of ~20 books, it invented a few new books -- making the result useless.

I gave "Flowers in a Bouquet" an A+, even though it missed a few flowers, because it knew the names of some other flowers that I didn't know!

Describing people:

This is a bit tricky because the AI is "censored" so as not to perpetuate stereotypes. But if you convince it the person is fictional, it will describe the person. It correctly describes face shape, hair, eyes, clothes, makeup, jewelry, etc., with very occasional glitches. Overall, this is a grade A.

This is perhaps the most commercially useful feature of GPT-4V. Imagine looking at a picture of a person, making some educated guesses about their background, and marketing to them appropriately! But of course this is the most ethically dangerous territory as well.

Describing a person (me) -- notice you have to "trick" it by saying the person is fictional


Reading text (OCR):

  • Typed text: A
  • Cursive handwriting: A+.
  • Messy young child's handwriting: A+.
  • Chinese text: C. (Well, it's better than me, but far worse than a Chinese speaker.)

This was the most astonishing aspect of GPT-4V to me. I'm used to OCR systems that can do a great job on typed text, but can't handle text in the real world. GPT-4V can read altered, distorted, messy, cursive, or heavily stylized English text with almost perfect reliability. It's really surprising that it does this well on a task that it's not even designed for.

(I did some tests on Chinese characters too though, and it couldn't understand most typewritten Chinese.)

Spatial relationships:

  • Rotating objects: F
  • Reading an analog clock: F
  • Navigating based on a fictional map: F
  • Identifying unusual objects based on context: A (for example, a chess board made of South Park figurines -- this only makes sense if it understands the spatial relationship of the figurines)
  • Playing GeoGuessr (game where you recognize a country based on picture): F
  • Imitating a web design: F
  • Identifying the leader of a running race: F
  • Playing Where's Waldo: F
  • Describing a complex graph/network diagram: C
  • Describing a complex numerical chart: F

It's obvious to me that GPT-4V has almost no sense of spatial relationships. It can "see" the objects in an image, but it doesn't know how they relate to each other in 2D or 3D. It's very weird behavior, since it does so well at many other tasks.

Creative tasks:

  • Describing a cartoon: A
  • Describing a work of art: A
  • Giving fashion advice based on a picture of a person: A

It did each of these tasks about as well as I would. (I'm no fashionista!)

I've tweeted almost all these experiments with screenshots over the past few days. If you'd like to see samples, let me know.

It's pretty good at identifying flowers in a bouquet


It can read extremely messy handwriting
It can't read a clock
It can't navigate this fictional train map


Conclusion

Overall, GPT-4V has an almost spooky ability to describe individual people, animals, or things. It can read incredibly messy handwriting, talk about a picture of a person, or explain a work of art.

But it's hilariously bad at spatial relationships, or images with a lot of distinct objects. If you give it a map it will get lost. If you give it a shelf of books, it will forget to list many of the books. It can't play Where's Waldo with a greatly cropped page that makes it very obvious where Waldo is. It can't read an analog clock, which is a task we teach to small children.

Is this what you expected from AI? Is there anything else you want to try? And where do you think it will go in the future?

Let me know in the comments!

I'm guessing from what you've seen that it's also pretty bad at interpreting things like graphs?

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了