“The False Promise of Imitating Proprietary LLMs” — A Provectus Perspective
The AI/ML market is incredibly dynamic, making it challenging to keep up.? At Provectus, we strive to stay abreast of the latest news and updates, delve into recent papers, and share the essence of our findings with the community.
In this installment of the "Provectus AI Review" series, we share our perspective on a paper titled The False Promise of Imitating Proprietary LLMs, authored by Arnav Gudibande et al. at 美国加州大学伯克利分校 .
Abstract
To start, let's examine the abstract of the paper.
In the paper, the authors evaluate the approach of finetuning weaker language models on outputs from stronger proprietary models like ChatGPT, to imitate their capabilities in a cost-efficient manner. They conducted an experiment where they finetuned Language Models (LMs) to imitate ChatGPT using varying base model sizes, data sources, and imitation data amounts. These models were evaluated using crowd raters and canonical NLP benchmarks.
The initial results were promising, with the imitation models appearing to follow instructions well and their outputs rated as competitive with ChatGPT by crowd workers. However, more targeted automatic evaluations revealed that the imitation models did not significantly close the gap between the base LM and ChatGPT on tasks that were not heavily supported in the imitation data. The authors found that the imitation models were good at mimicking ChatGPT's style, but not its factuality.
The authors concluded that model imitation is a false promise. There is a substantial capability gap between open and closed LMs that cannot be bridged using current methods without an unwieldy amount of imitation data or more capable base LMs. Therefore, they argue that the most effective way to improve open-source models is to focus on developing better base LMs, rather than attempting to imitate proprietary systems.
Major Takeaways and Provectus Perspective
In this section, we will delve into some of the points raised in the paper and share our thoughts on potential solutions.
#1
We consider decoder-only models ranging in size from 1.5B to 13B parameters: GPT-2 1.5B (Radford et al., 2019), LLaMA 7B (Touvron et al., 2023), and LLaMA 13B.
In practical terms, this suggests that alternative architectures, such as encoder-decoder models, should also be explored and evaluated. For instance, we could consider models like Google's UL2 or T5. It would also be beneficial to provide some budget estimates for these potential explorations.
#2
For human evaluation, we conduct blind pairwise output comparisons using Mechanical Turk. In our UI, we present each rater with a task instruction and the output of two unknown models, one of which is ChatGPT and the other is one of our imitation models.
Every significant project requires a tool for human assessment that allows for a seamless switch between a private workforce and a public crowd. A lot of open-source solutions that do exactly this already exist on the market. For example, Ray's Aviary, which was recently released, could prove to be a useful tool in this context.
#3
Training local imitation models is far more successful… This demonstrates that it is far more feasible to distill a specific behavior from ChatGPT as opposed to broadly matching its capabilities.
In practical terms, this implies that the approach can be utilized for a broad spectrum of specific tasks that extend beyond mere replication of ChatGPT.
For example: Fine-tuning OpenAI models for English-to-Cypher translation can be quite costly. We can argue that a well-tuned open-source LLM, utilizing the imitation approach, could be more feasible for continuous improvement in the translation task, without incurring the high costs of fine-tuning.
领英推荐
#4
An open problem is whether these performance regressions can be mitigated using regularization or by mixing in pre-training data during fine-tuning.
Here is an interesting observation: the authors highlight an issue that arises when you fine-tune a model on conversational-style data, but then apply a different benchmark for evaluation.
#5
However, it also may be possible to use the target model to perform RLHF or constitutional AI… to further improve results. Lastly, we only considered relatively simple methods for collecting imitation data, however, there may be more advanced methods (e.g., active learning) that may improve the effectiveness or efficiency of model imitation.
Here is another noteworthy point: the authors highlight that despite certain constraints, there are still lots of methods to explore.
Note: according to OpenAI's terms of use, any dataset sampled from their models technically renders the fine-tuned LLM unusable for commercial purposes.
Conclusion
In The False Promise of Imitating Proprietary LLMs, the authors evaluate the effectiveness of model imitation as means of enhancing open-source Language Models (LMs). The findings suggest that businesses can gain a competitive edge by pre-training powerful base models, but also that one group can mimic another's model if their base LMs are equally competent. The study raises questions about the future of human evaluation, about improving open-source LMs, and ethical and legal issues surrounding the use of proprietary models by the open-source community.
The Provectus perspective on the paper is clear: it is definitely worth a read. Publications like this are always extremely valuable as they go through a thorough evaluation process. There are still many opportunities - and gaps, however - that can and should be investigated in and around the topic.
Authors:
Rinat Gareev — Senior Solutions Architect || Provectus
Marlon Cajamarca Vega?— ML Engineer & AI Educator || Provectus
Moving Forward — Learn more about Provectus AI expertise