Translation Quality Evaluation Is All We Need
“The unpredictable abilities emerging from large AI models: Large language models like ChatGPT are now big enough that they’ve started to display startling, unpredictable behaviors.”
“Recent investigations have revealed that LLMs can produce hundreds of “emergent” abilities — tasks that big models can complete that smaller models can’t.”
“New analyses suggest that for some tasks and some models, there’s a threshold of complexity beyond which the functionality of the model skyrockets.”
These March headlines (along with the hundreds of others) took the AI hype to new heights. In the AI/ML circles, all everyone would talk about was the General Artificial Intelligence literally happening tomorrow.
The lists of the emergent abilities were compiled. Some people grew super excited, while others – super fearful. Petitions were drafted to implore the authorities to put a stop on training neural models so as to prevent inevitable harm. The clamor just wouldn’t cease, and this went on for several months.
It shouldn’t come as much of a surprise then that the last week’s release of an article entitled “Are Emergent Abilities of Large Language Models a Mirage?” https://arxiv.org/pdf/2304.15004.pdf by the Stanford University researchers was greeted with much less attention and fanfare. The question in the title of the article is, of course, a very polite way of decisively stating that the aforementioned emergent abilities really are a mirage. They do not exist.
In the dry language of academic papers, the article’s authors wrote: “Emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance”.
Oops…
“Changing from a nonlinear metric to a linear metric such as Token Edit Distance, scaling shows smooth, continuous and predictable improvements, ablating the emergent ability.”
Well, there it is. Seems like the first thing that ML/NLP researchers should be taught is to verify their metrics and benchmarks before coming out with hype-inducing announcements. Listening to professional linguists and practitioners of the translation craft would be another excellent advice to follow.
Since the multi-headed attention mechanism was invented, we have been witness to a constant pattern of AI/ML hype. If it had remained solely a media phenomenon, it would be one thing. Unfortunately, this hype interferes with the smooth production process that has been established in the localization industry, confusing both the service suppliers and customers alike.
If you ever studied a crab closely, you would’ve noticed that one of their pincers is always much greater than the other – as they constantly act with their dominant claw, the other withers. Today, machine translation and NLP look very much like such a crab: almost anybody can take some data, put it into some model, and get some sort of result. But when it comes to evaluating the results and measuring their quality, almost nobody seems to have this ability.
It’s time to learn how to measure the quality of MT translation output correctly!
The “classic” BLEU and Rouge metrics are actually similarity metrics, not “quality” ones – and our recent studies have demonstrated that such a metric as COMET is completely defunct in its standard implementation. To overcome these shortcomings, we have created our proprietary but publicly available LOGIPEM/HOPE https://arxiv.org/pdf/2112.13833.pdf metric that can be used to measure the quality of MT output – and this only requires one hour of a professional human linguist’s time. The evaluation can be done with our Perfectionist (https://perfectionist.app) tool.
And if truth be told, the localization industry really shouldn’t skimp on that one hour and compare different MT outputs before choosing the engine and making other important decisions, such as the economic feasibility of a custom training of some LLM.
Proper LQA/TQE is really all we need to verify the claims of any upcoming breakthrough.
And yes, the room temperature superconductor does not exist either. It was yet another unverified claim. Sorry to burst your bubble, but wouldn’t it be nice if the claims were verified and tested by professionals and then announced? That’s what we are doing at our R&D Lab.
Follow our newsletter and research to save time and find efficiencies that work!
We take inspiration from real life use cases and always test the facts. You are welcome to share your use case and data to participate in our R&D effort. Book a conference call to discuss R&D collaboration.
Localization Director at PTC
1 年??
Localization Program Manager | Localization Technology Expert
1 年People often forget that just because something is said on TV doesn't make it true. :) With technology, you must always test it to see how it actually works.