“Roast Me a Stork”: Idioms, Understanding, and Large Language Models
Screenshot from ChatGPT

“Roast Me a Stork”: Idioms, Understanding, and Large Language Models

Yesterday I ran a test using GPT 3.5 Turbo and GPT 4 that ended up exposing some insights into the working of the model. I had set out to test how well ChatGPT would do in evaluating translations and explaining its reasoning for assigning a 0 to 10 score to a translation, using the following minimal instruction. It’s not the most sophisticated instruction, but I wanted to build up to a good prompt.

A simple prompt to turn GPT 3.5 Turbo into a translation evaluation machine.

After a few tests using relatively simple things that gave reasonably plausible results, I switched to a favorite German idiom, “Brat mir einer einen Storch.” This is a really opaque idiom that literally means “Somebody roast/fry me a stork.” It is used to express extreme surprise.

In my first prompt, I forgot that the online ChatGPT interface doesn’t handle inserting returns into prompts well, so I only gave it the source as a prompt.

GPT 3.5 Turbo tried to be helpful by supplying its own (bad) translation

Here, it surprised me by supplying its own translation into English: “Somebody bring me a stork”. I would disagree with its rating of 5/10 because, if it is intended as a translation of the sense, it misses the target completely, and, if it is intended to literally translate the words, it has turned the imperative verb Brat (‘roast’ or ‘fry’) into ‘bring’. Either way, 5/10 is not justifiable. But the reasoning it gives is complete nonsense and the final line is just wrong.

So I went back and edited the prompt and supplied a stupidly literal target translation. That resulted in the following:

Testing a literal translation generated bizarre results

Here things start to get weird. First, although it now says that Brat can be translated as ‘fry,’ it can’t handle the English verbal particle up at all and so chunks the text incorrectly and says the translation is wrong. Now this translation is not a good one, except perhaps as an explanation for the words in the idiom, but it is definitely better than the previous one, which is simply wrong, but GPT 3.5 Turbo ranks this one much lower.

For my third trial, I used an equivalent English idiom, “Well, I’ll be a monkey’s uncle.” Like the German source, this is not transparent in its meaning at all.

GPT 3.5 Turbo rates the correct translation as worst of all

Pretty much everything in this assessment is wrong, but the ways it goes wrong are instructive:

  1. We see a priming effect. Supplying “Well, I’ll be a monkey’s uncle” as the translation has now tripped GPT 3.5 Turbo into the right conceptual domain and it now supplies a correct meaning for the idiom. So the prompt has triggered parameters in the model that finally bring it into some coherence with the German text. I could actually see the equivalent happening with someone who had heard the idiom once before and forgot the meaning, but upon seeing the English would remember the German.
  2. It doesn’t really understand the idioms – or what it itself is saying – at all. Note how it explains the meaning of the English idiom as almost identical to the meaning of the German but also says “it does not capture the meaning or essence of the source text at all” and that it “does not align with the original German idiom.” The other two points basically reiterate this claim. So the response here is pretty much incoherent because it contradicts itself.
  3. The “cultural context” point is utter nonsense and – in this and other responses – usually irrelevant. Examples of translation evaluations in Common Crawl (which is the primary training source for GPT 3.5 Turbo and GPT 4) must like to emphasize cultural context, which makes sense, but LLMs don’t really have cultural experience, so the assessments they make are just weird.

Because of the weirdness, I did some follow-up questions, and got this:

Asking questions improves the results

When I questioned the cultural context, it then discounted that as a factor, but reiterated the nonsense about the meaning being off. When asked about the meaning, it backtracked there. But even then it dings points off because it claims it is not a “direct equivalent.”

I asked it for a direct equivalent (not shown above) and it suggested that “Would you look at that” would be “a bit closer in meaning to the original idiom while maintaining clarity in English” and says that it is better because “this phrase is less idiomatic and more straightforward.” I’m not actually sure that “Would you look at that” is any less idiomatic or any more transparent in its meaning than “Well, I’ll be a monkey’s uncle” and I find it rather weaker in its meaning.

Can we learn any lessons?

I found that using ChatGPT in this way – to evaluate translations – was pretty useless, at least for anything creative. It tended to work better for very literal text, but the scores it assigned were basically meaningless. Maybe, with a better prompt that outlined the grounds for evaluation, it would do better, but I am not sure. But here are a few takeaways:

  • LLMs don’t understand texts. I know a lot of researchers will disagree, but looking at the bit about “me up a stork” in the second interaction makes it clear that, whatever an LLM is doing, it isn’t parsing the text the way a native speaker of English would. And the weird insistence in the third interaction that the two idioms are completely different in meaning, even though it defines them as meaning the same thing, shows that it was producing text without any recourse to what we might call “meaning.” This was incoherent output.
  • LLMs do not think things through. When the late and unlamented Galactica model came out, its creators claimed that it could reason and think. GPT 3.5 Turbo is of the same vintage, and if this is thinking and reasoning, it is doing a poor job of it. The explanations don’t make sense and the model has no notion of what is relevant or irrelevant in what they say. For instance, I cannot imagine a semi-competent human evaluator making the claims about cultural relevance, or arguing that two idioms are completely unrelated right after correctly explaining both of them in very similar terms.
  • LLMs are unreliable. If I were using GPT 3.5 Turbo as a tool for translation evaluation – at least with this prompt history – I would have to toss it out because the numbers and reasoning it assigned are completely useless. Although I didn’t share my other trials, I found that the numbers varied wildly, even in the case of rather literal, informative text. They weren’t as bad as for this idiom, but they were also not reliable for decision making. Could they be improved with better prompts? Maybe, but I don’t think they could be reliable because they depend on understanding that the models don’t have. I could as soon ask an aardvark for a reliable translation quality evaluation…

But What about GPT 4?

All the trials here were with GPT 3.5 Turbo. I found that GPT 4 does perform better in these tasks, but still has problems. For instance, it said that “Somebody fry me up a stork” was a 7/10, which is the same rating I ended up with for the “monkey’s uncle” idiom in GPT 3.5 Turbo. I think that’s clearly wrong: It’s either a 0/10 (because it doesn’t convey the idiomatic intent at all) or a 10/10 (because it conveys the literal meaning). But it isn’t somewhere in between.

It did rate the “monkey’s uncle” translation as a 9/10, which is probably reasonable, but I still wouldn’t trust it for anything where reliability is important. Maybe GPT-type foundation models will get there, but I think we are seeing the limits of this approach already.


In any event, I hope this exploration of an obscure German idiom gives something to think about as we learn more about LLMs and their strengths and weaknesses.


Konstantin Savenkov

CEO @ Intento - AI agents for enterprise localization.

11 个月

Arle Lommel you asked to leave the "you prompted it wrong" comment. Let me fill this gap :-) It's extremely important to provide all the context in the prompt and make no assumptions. Even more than when working with humans :-) You make two assumptions here: (1) the AI understands translation and translation quality in the same way you do, and (2) the model knows what specific English locale you mean. When prompted without those two assumptions, the GPT-4 provides a proper answer (see the screenshot from our tool). Please note, however, that in ChatGPT, the default temperature is quite high, so I'd recommend testing such things somewhere where you can control the temperature.

  • 该图片无替代文字
Camilla Clark

AI Researcher and UX/HCI Designer | Senior Informatics Student at Indiana University

11 个月

This is such an interesting use of ChatGPT. I'm interested in how AI developers might go about improving performance on tasks like these since, as you mentioned, LLMs don't really have cultural experience to draw from in the same sense that human translators do.

回复
Alycia Meyers

Senior Technical Writer at CU*Answers

12 个月

Interesting . Thanks for sharing

回复
Marina Ilari, CT ?? LocWorld

Video Game Localization & LQA | Women in Games Ambassador | CEO @ Terra

12 个月

Very interesting assessment. Thank you for sharing, Arle Lommel!

回复
Lisa Trent

Translation Project Manager; Localization Project Coordinator; IT Project Manager & Information Architect; Policy Specialist

12 个月

Very interesting article - thank you for posting!

要查看或添加评论,请登录

社区洞察