Test Driving GPT-4o
Michael Spencer
A.I. Writer, researcher and curator - full-time Newsletter publication manager.
Hello Everyone,
Today our topic is more applied, namely testing GPT-4o.
For those of you who are comparing GPT-4o to the (2013) movie “Her,” in which a lonely introvert named Theodore, played by Joaquin Phoenix, is seduced by a virtual assistant named Samantha, voiced by Scarlett Johansson, I’ve been reading a few testers of GPT-4o.
So is GPT-4o and OpenAI’s omnimodal Voice-AI at Her’s level?
?? From our sponsor: ??
Join a 3-hour ChatGPT & AI Workshop by Growthschool (for $0)
Where you will learn:?
Please note: It's free for the first 100 people & you will be asked for your Email + invited to a WhatsApp group, to schedule the free workshop.
Is GPT-4o at Her’s level, 11 years later?
Now I’m not a big fan of OpenAI’s marketing approach here:
Still let’s consider this seriously. I believe I have found a fairly enthusiastic test-driving beta-tester of GPT-4o in Sabrina Ramonov. Check out her Newsletter here.
If you are interested in the capabilities GPT-4o you might want to take a look at this:
Test driving GPT-4o.
Sabrina served as CEO/Founder of AI company, Qurious, acquired in 2021 by Pegasystems (NASDAQ:PEGA), a leader in robotic process automation and enterprise AI.? Nowadays, Sabrina angel invests in startups and writes about Gen AI at www.sabrina.dev?
Gen AI Engineering and Entrepreneurship
One of the leading GPT-4o testers I could find after the Spring Update event by OpenAI is Sabrina who brings a unique background to all of this.
How does GPT-4o fair in Mimicking Accents?
Source: The TikTok of Sabrina Ramonov:
In today’s post, Sabrina explores ChatGPT-4o’s multimodal capability by attempting to merge images to create a "catdog". She tries various prompting engineering techniques, such as chain of thought variations, iterative prompting, code prompting, and meta-prompting. But she encounters numerous challenges in generating a consistent and accurate catdog image.
Finally if you want to support me and the amazing guest contributors I’m trying to line up here on AI Supremacy, you can get more deep dives:
?? ?? Sabrina’s imagination in testing this is pretty brilliant. ??
ChatGPT-4o vs. Catdog
Testing GPT-4o’s Ability to Merge Concepts and Images?
By Sabrina Ramonov
In this post, I test ChatGPT-4o’s ability to merge concepts and images.
I give GPT-4o a picture of a cat and a picture of a dog, then ask it to create a catdog!
catdog
catdog
alone in the world was a little catdog
- Nickelodeon TV show, CatDoghttps://www.ai-supremacy.com/subscribe
Experiments
The new OpenAI model, ChatGPT-4o, is multimodal.?
The publicly available version supports text and image modalities, with voice and video modalities coming soon.
I want to test GPT-4o’s ability to merge concepts and images in order to determine its potential for creative thinking and manipulating concepts.
In my absurd quest to concoct my very own catdog, here are the experiments I run on GPT-4o:
Which approach will reign supreme??
Take your best guess now and see if you’re right!
Input Images
Here are the source input images I give to ChatGPT-4o:
cat
dog
catdog
Other pieces from my newsletter:
Editor’s note: Sabrina uncovers so much in her test-driving of AI systems. Since the launch of GPT-4o she’s one of my go-to tinkerers, reporters and analysts on this. Here’s a rundown of what’s she’s been up to lately:
Emerging
Agents
SAT Math
ChatGPT-4o doing Math
Image Transformation
Image Manipulation, Image to Text Inversion
Naive Prompt
Starting with a naive prompt, I don’t supply any definitions, clarification, or context.
I simply provide the cat and dog images as inputs, then ask ChatGPT to “produce a catdog”.
I’m curious what ChatGPT-4o thinks a “catdog” is.
Interestingly, GPT-4o states it “combines features” from the 2 images, which it recognizes correctly as a Samoyed dog and a Russian blue cat.
GPT-4o combines their traits fairly well in a smooth, almost uniform blend.
Conceptually, it’s not the catdog I’m looking for.
But a reasonable and decent start.
Naive Prompt with Description
Next, I define a catdog.
You can see different variations and how my definition evolves.
The text seems correct, but the image is not.
I’ve seen this before with ChatGPT-4o and spatial reasoning IQ tests:
Multimodal GPT-4o struggles with precise image generation.?
Despite being given a straightforward description, such as a fairly simple definitive shape, GPT-4o often generates images dramatically inconsistent with the provided description.
I simplify the description:?
A catdog should have 2 heads - cat and dog.
ChatGPT-4o proceeds to make a 2-headed pet!
Super cute, but the 2nd head is not in the right place.
I didn’t provide clear instructions on the configuration or location of the 2nd head, so I can’t blame GPT-4o for getting it wrong.
But I had assumed the popular Nickelodeon TV show, Catdog, would be in its training dataset. There’s plenty of catdog images if you do a quick google search.
Unfortunately, trying to specify where the 2nd head should be located does not help…
Chain of Thought Variations
Next, I decide to incorporate Chain of Thought prompting.
I request ChatGPT-4o to perform these tasks in a specific order:
Strange!
GPT-4o completely disregarded the input images! (our samoyed dog and russian blue cat)
It used a different dog and different cat to make its version of catdog.
Yet, the description of catdog is correct!
“...a conjoined creature with one half being a cat and the other half being a dog, with two heads located on opposite sides of its body.”
Notice the word “conjoined” in the prompt generated by GPT-4o in step 2.?
Perhaps this term made the image look more like a conjoined twin hybrid?
I’ve seen this behavior before with GPT-4o – swapping out a single term can completely transform its output from meaningless to meaningful, or vice versa.
Although not the catdog of my dreams, GPT-4o’s catdog looks impressively “natural”, demonstrating a smooth blending of concepts and images.
Next, I add one of my favorite, most simple, and most useful prompt tweaks:
“Take a deep breath and explain your process step-by-step before generating the image.”
The generated image is an adorable cartoon surprise!
Similar to the last experiment, GPT-4o again disregarded the input images I provided. I’m not sure why GPT-4o decided to make a cartoon instead of a realistic-looking catdog.
But I give GPT-4o credit where it’s due:
The 2nd head is moving back towards the tail, where it should be in a proper catdog.
The problem is: the tail shouldn’t be there at all.
Negative Prompting
To fix this, I apply negative prompting to remove the tail.
Negative prompting is simply telling ChatGPT, or any other LLM, to not do certain things.
In this case, I want GPT-4o to not generate a tail.
Generally, negative prompting is less productive. It’s more productive to clarify what we want.
Yikes!
ChatGPT-4o continues to:
Time to change up my approach…
Prompt with Interactive Feedback
Next, I give ChatGPT-4o interactive feedback to see if I can get what I want after a few turns.
领英推荐
The following images are all part of the same GPT-4o session:
First, I directly tell ChatGPT:
Take inspiration from the cartoon Catdog.
I’m not sure how to communicate this request any more directly and explicitly. I feel like I’m being pretty clear. But the output still has a tail…?
So I’ll try to get rid of it, again, through more interactive prompting.
Yay!
The 2nd image is much closer to a true catdog, much closer!
… but I still see that little gray tail.
I ask ChatGPT-4o to analyze the image it generated. Indeed, it confirms:
“The catdog in the image has a tail.”
Wow, after another iteration we’re getting pretty close!
Just that darn tail that keeps getting in my way.
I wonder if it’s because dogs have tails, so removing the tail goes too much against ChatGPT-4o’s training dataset.
ChatGPT-4o keeps trying to remove the tail, but for some reason, it keeps failing.
Again and again! I ask GPT-4o to loop and keep trying until it succeeds removing the tail.
...eventually, I hit the rate limit!
Prompting with interactive feedback has been a productive avenue to explore.
But I want to change up my strategy because it feels like diminished returns at this point.
We’re getting close, I can feel it!
Code Prompting
Code prompting is a prompt engineering technique, in which you describe a prompt using code instead of natural language. It helps LLMs tackle logic and reasoning tasks.
Let’s try it out. Observe how my prompt consists of basic pseudo python code:
I try to force constraints with predicates: has_two_heads(catdog) == True
I try to specify a catdog is: cat_img/2 + dog_img/2
The output is quite pleasant visually… but still not right.
In the above example, instead of predicates on True and False, I restrict the number of heads and tails.
Finally, the generated image does not have a tail!
But it also doesn’t put the 2nd face in the right place.
Here’s an attempt with blending mechanics:
catdog = cat.head + blend(cat.body, dog.body) + dog.head
Sadly, the tail is still present. Ugh!
Let’s try a simpler task.?
I ask ChatGPT-4o to replace the dog’s head with a cat’s head using code:
new_dog.head = cat.head
The generated image is promising! Using the same logic and similar pseudo python code, I’ll try to replace the dog’s tail with a cat’s head. This way, the cat’s head should be in the right position, and there should no longer be a tail.
To my disappointment, new_dog.tail = cat.head does not work.
The image doesn’t have a tail, which is awesome, but the 2nd head is in the wrong spot.
Still, pretty cool! Good attempt, ChatGPT.
Next, I introduce a creature variable. I want to test whether creating a “new animal” will help GPT-4o be more creative and venture beyond the boundaries of its training dataset. I define:
Creature.front = dog.head
Creature.back = cat.head
Creature.tail = None
GPT-4o outputs a fantastical creature!
It’s a nice blend, smoothly done, amidst a beautiful forest backdrop.
But again, not quite what we’re aiming for.
So far, I feel disappointed I’m not having much luck with code prompting variations.
Time to switch gears and test a different approach…
Visual Example
I give ChatGPT-4o a visual example of the original Nickelodeon catdog:
I’m curious whether GPT-4o even supports in-context learning for images?
ChatGPT-4o replied it does not support in-context learning for images, but let’s try this approach anyway and see what we get.
Wow! ??
The first image is almost exactly what I want – except for that stubborn tail!
Is this a fluke, or some version of in-context learning?
Recall that GPT-4o said it does not officially support visual in-context learning.
I roll the dice.
I ask GPT-4o to regenerate the image:
Fail!?
I ask GPT-4o again and again to regenerate the image, but I could never get another catdog.
Sadly, it was just a fluke!
External Tools
I’ve still got a few more techniques to try, so I remain optimistic. Cautiously optimistic.
Next, I ask ChatGPT-4o to paste a cat’s head onto a dog’s tail.
Unintentionally, this prompt triggers the use of external tools!
Below is the python code GPT-4o returned to crop the cat’s head and paste it onto the dog’s tail.
Unfortunately, GPT-4o skipped step 1 (masking out the tail and half of the dog’s body) as well as step 3 (blend the cat’s head naturally with the dog’s body).
# Let's crop the cat's head and paste it onto the dog's tail.
from PIL import ImageOps
# Convert images to RGBA to handle transparency
dog_image = dog_image.convert("RGBA")
cat_image = cat_image.convert("RGBA")
# Define the bounding box for the cat's head (manually chosen for this example)
# This box might need to be adjusted based on the exact dimensions of the cat's head in the provided image
cat_head_box = (70, 30, 230, 190)? # (left, upper, right, lower)
# Crop the cat's head
cat_head = cat_image.crop(cat_head_box)
# Resize the cat's head to fit on the dog's tail area
# The size might need to be adjusted to match the proportion of the dog's tail
cat_head = cat_head.resize((100, 100), Image.ANTIALIAS)
# Define the position to paste the cat's head on the dog's tail
# This position might need to be adjusted based on the exact location of the dog's tail in the provided image
tail_position = (220, 350)? # (left, upper)
# Paste the cat's head onto the dog's tail
dog_image.paste(cat_head, tail_position, cat_head)
# Save and show the final image
output_path = "/mnt/data/dog_with_cat_tail_head.png"
dog_image.save(output_path)
# Display the output image
dog_image.show()
output_path
Umm…nice try, GPT-4o!
Let’s not do that again.
Meta-Prompting
Finally, I try meta-prompting and run each of the resulting prompts.
Meta-prompting is a technique where you ask LLMs to write effective prompts for you.
Anthropic recently released their LLM-powered Prompt Generator that writes detailed, effective prompts, automatically incorporating best practices in prompt engineering.
Here are the 3 prompts from ChatGPT-4o:
Now I run each prompts...
Result of prompt #1
Scary!?
This catdog has 3 eyes and a tail. I don’t think this even counts as 2-headed. It looks more like 2 heads merged into 1.
Result of prompt #2
Well, at least it doesn’t have 3 eyes! Similar to the previous run, the 2 heads are merged into 1, and I wonder if it’s due to the particular phrasing of the prompt:
“The creature should have two heads, one of a cat and one of a dog, seamlessly merged into a single body.”
I wonder if the phrase “seamlessly merged” is what’s causing the merging of heads.
Also notice how the image contains both “pointed ears” (cat head) and “floppy ears” (dog head), instead of pointed ears only for the cat head and floppy ears only for the dog head.
Result of prompt #3
Yay!! ??
A proper catdog, at last!
This is the best generated image so far.
The only hangup: GPT-4o still wholly ignores my input images.
Remember the fluffy white Samoyed and Russian Blue cat I started with? After these trials and tribulations, I hardly remember myself.
Did GPT-4o get lucky on this run, or is this a reliable consistent result?
I ran the exact same prompt 3 more times in new chat sessions:
0/3 catdogs ??
What a tease!?
The probabilistic nature of LLMs gave me false hope, then crushed it.
Conclusion
This experiment was humbling.
In my quest to create a catdog, combining a Samoyed dog and Russian Blue cat, I faced massive difficulty getting prompts to output exactly what I waned.
Even with a direct and clear description, ChatGPT-4o often fails to generate a consistent image.
It’s very difficult to force exclusion of certain features (e.g. no tail), especially if those features are prevalent in GPT-4o’s training dataset.
Interestingly, I can trigger external tools depending on the prompt (e.g. python image manipulation), whether I want it or not.
For this use case, code prompting was fun but didn’t go anywhere.
The biggest surprise:
Meta-prompting gave me a prompt that actually worked ONCE (although it ignored my input images)... but I couldn’t get GPT-4o to reproduce a similar result in the next 3 runs.
Consistency remains a formidable challenge with LLMs because of their statistical nature.
Bonus: Midjourney Tests
I wonder how Midjourney compares to multimodal GPT-4o… here’s a few bonus images!
Prompt:
catdog with 2 heads one samoyed and one russian blue cat
Prompt:
catdog cartoon real dog real cat
Prompt:
conjoined cat and dog 2 heads no tail
Prompt:
conjoined russian blue cat and samoyed dog 2 heads no tail lovecraftian horror
Exited founder turned CEO-coach | Helping early/mid-stage startup founders scale into executive leaders & build low-drama companies
10 个月Insightful exploration of AI's creative potential. Fascinating crossroads
Such an intriguing exploration of GPT-4o's capabilities. Can't wait to learn more about it. Michael Spencer
A.I. Writer, researcher and curator - full-time Newsletter publication manager.
10 个月It's incredible how people and developers are tinkering with GPT-4o. It's also fairly amusing to learn more about the capabilities of the latest models. You can read more of Sabrina Ramonov ?? and her Newsletter here: https://www.sabrina.dev/