“Roast Me a Stork”: Idioms, Understanding, and Large Language Models
Yesterday I ran a test using GPT 3.5 Turbo and GPT 4 that ended up exposing some insights into the working of the model. I had set out to test how well ChatGPT would do in evaluating translations and explaining its reasoning for assigning a 0 to 10 score to a translation, using the following minimal instruction. It’s not the most sophisticated instruction, but I wanted to build up to a good prompt.
After a few tests using relatively simple things that gave reasonably plausible results, I switched to a favorite German idiom, “Brat mir einer einen Storch.” This is a really opaque idiom that literally means “Somebody roast/fry me a stork.” It is used to express extreme surprise.
In my first prompt, I forgot that the online ChatGPT interface doesn’t handle inserting returns into prompts well, so I only gave it the source as a prompt.
Here, it surprised me by supplying its own translation into English: “Somebody bring me a stork”. I would disagree with its rating of 5/10 because, if it is intended as a translation of the sense, it misses the target completely, and, if it is intended to literally translate the words, it has turned the imperative verb Brat (‘roast’ or ‘fry’) into ‘bring’. Either way, 5/10 is not justifiable. But the reasoning it gives is complete nonsense and the final line is just wrong.
So I went back and edited the prompt and supplied a stupidly literal target translation. That resulted in the following:
Here things start to get weird. First, although it now says that Brat can be translated as ‘fry,’ it can’t handle the English verbal particle up at all and so chunks the text incorrectly and says the translation is wrong. Now this translation is not a good one, except perhaps as an explanation for the words in the idiom, but it is definitely better than the previous one, which is simply wrong, but GPT 3.5 Turbo ranks this one much lower.
For my third trial, I used an equivalent English idiom, “Well, I’ll be a monkey’s uncle.” Like the German source, this is not transparent in its meaning at all.
Pretty much everything in this assessment is wrong, but the ways it goes wrong are instructive:
Because of the weirdness, I did some follow-up questions, and got this:
When I questioned the cultural context, it then discounted that as a factor, but reiterated the nonsense about the meaning being off. When asked about the meaning, it backtracked there. But even then it dings points off because it claims it is not a “direct equivalent.”
I asked it for a direct equivalent (not shown above) and it suggested that “Would you look at that” would be “a bit closer in meaning to the original idiom while maintaining clarity in English” and says that it is better because “this phrase is less idiomatic and more straightforward.” I’m not actually sure that “Would you look at that” is any less idiomatic or any more transparent in its meaning than “Well, I’ll be a monkey’s uncle” and I find it rather weaker in its meaning.
Can we learn any lessons?
I found that using ChatGPT in this way – to evaluate translations – was pretty useless, at least for anything creative. It tended to work better for very literal text, but the scores it assigned were basically meaningless. Maybe, with a better prompt that outlined the grounds for evaluation, it would do better, but I am not sure. But here are a few takeaways:
But What about GPT 4?
All the trials here were with GPT 3.5 Turbo. I found that GPT 4 does perform better in these tasks, but still has problems. For instance, it said that “Somebody fry me up a stork” was a 7/10, which is the same rating I ended up with for the “monkey’s uncle” idiom in GPT 3.5 Turbo. I think that’s clearly wrong: It’s either a 0/10 (because it doesn’t convey the idiomatic intent at all) or a 10/10 (because it conveys the literal meaning). But it isn’t somewhere in between.
It did rate the “monkey’s uncle” translation as a 9/10, which is probably reasonable, but I still wouldn’t trust it for anything where reliability is important. Maybe GPT-type foundation models will get there, but I think we are seeing the limits of this approach already.
In any event, I hope this exploration of an obscure German idiom gives something to think about as we learn more about LLMs and their strengths and weaknesses.
CEO @ Intento - AI agents for enterprise localization.
11 个月Arle Lommel you asked to leave the "you prompted it wrong" comment. Let me fill this gap :-) It's extremely important to provide all the context in the prompt and make no assumptions. Even more than when working with humans :-) You make two assumptions here: (1) the AI understands translation and translation quality in the same way you do, and (2) the model knows what specific English locale you mean. When prompted without those two assumptions, the GPT-4 provides a proper answer (see the screenshot from our tool). Please note, however, that in ChatGPT, the default temperature is quite high, so I'd recommend testing such things somewhere where you can control the temperature.
AI Researcher and UX/HCI Designer | Senior Informatics Student at Indiana University
11 个月This is such an interesting use of ChatGPT. I'm interested in how AI developers might go about improving performance on tasks like these since, as you mentioned, LLMs don't really have cultural experience to draw from in the same sense that human translators do.
Senior Technical Writer at CU*Answers
12 个月Interesting . Thanks for sharing
Video Game Localization & LQA | Women in Games Ambassador | CEO @ Terra
12 个月Very interesting assessment. Thank you for sharing, Arle Lommel!
Translation Project Manager; Localization Project Coordinator; IT Project Manager & Information Architect; Policy Specialist
12 个月Very interesting article - thank you for posting!