Trends in LLMs - QLORA: Efficient Finetuning of Quantized LLMs
The QLORA paper introduces a new method called QLORA that enables highly efficient finetuning of massive language models using 4-bit quantization and low-rank adapters. QLORA can reduce the GPU memory requirements for finetuning a gigantic 65 billion parameter language model from over 780GB down to just 48GB, without any measurable loss of performance compared to regular full-precision 16-bit finetuning. This is a game-changing breakthrough that allows researchers to finetune the largest publicly available language models on a single consumer GPU for the first time, whereas previously it required clusters of hundreds of expensive high-end GPUs costing millions of dollars.
Leveraging QLORA's efficiency, the authors train a family of optimized chatbot models called Guanaco, ranging from 7 billion to 65 billion parameters. Their 65 billion parameter Guanaco model achieves state-of-the-art results, reaching 99.3% of the performance of the proprietary ChatGPT model on the challenging Vicuna benchmark. This is the closest any publicly released model has come to matching the coveted performance level of ChatGPT. Even more impressively, the smaller 33 billion parameter Guanaco model slightly outperforms Vicuna, the previous best publicly available 13 billion parameter chatbot, while using substantially less memory thanks to 4-bit quantization.?
The key innovations in QLORA that enable these unprecedented results are 4-bit quantization of the weights of a pretrained language model to massively reduce memory usage, adding low-rank adapter layers that are fine-tuned in full 16-bit floating point precision, and specialized techniques like double quantization of quantization constants and paged optimizers to further minimize memory requirements and seamlessly handle training workloads.
A core component of QLORA is the use of a novel 4-bit data type called NormalFloat (NF4) for quantizing language model weights, which is information-theoretically optimal for the normal distribution that pretrained neural network weights typically follow. Empirically, NF4 significantly outperforms regular 4-bit floating point across a variety of language models and scales on metrics like perplexity and downstream task performance. QLORA also employs an additional stage of quantization called double quantization, which further quantizes the first-stage quantization constants from 32-bit down to just 8-bits. This provides substantial additional memory savings without any loss in accuracy after fine-tuning.?
To evaluate QLORA against traditional fine-tuning, the authors conducted extensive experiments comparing full-precision 16-bit fine-tuning against 4-bit QLORA finetuning on benchmarks like GLUE, Super-Natural Instructions, and MMLU across a range of model architectures and scales up to 3 billion parameters. They found that commonly used hyperparameters for low-rank adapter methods are significantly under tuned, and using adapters sparsely on just attention layers is insufficient to match the performance of full 16-bit fine-tuning. Applying low-rank adapters densely to every layer is critical to achieve parity with 16-bit fine-tuning.
Across diverse models, dataset scales, and NLP tasks, QLORA reliably reproduces within margin of error the results obtained by regular full precision fine-tuning. This definitively demonstrates for the first time that 4-bit quantization works not just for inference but also for effectively training state-of-the-art natural language processing models.
The authors were able to train and evaluate large chatbot models at scales far beyond what is possible with regular full precision fine-tuning. They scaled up models from 7 billion to 65 billion parameters and tuned them on a diverse set of modern instruction tuning datasets. The trained models were evaluated on standard benchmarks like MMLU, Vicuna, and OA (OpenAssistant). Their experiments surface key findings about the interplay between model scale, dataset size, and real-world performance. Most strikingly, dataset quality and suitability for the end task are vastly more important than sheer dataset size in determining a model's capabilities. For instance, the 65 billion parameter Guanaco reached over 99% of ChatGPT's score on the Vicuna benchmark after training on just 9,000 high-quality examples from OASST1, while finetuning on orders of magnitude more data from less suitable datasets performed far worse. This reveals proper dataset selection and preparation is crucial compared to simply using the largest available dataset.
The authors employ a thorough evaluation protocol to assess the conversational ability of the trained Guanaco chatbot models, using thousands of ratings from both crowd worker human annotators and judgments generated by the GPT-4 model. They find moderate agreement between human and AI judgments of model performance at the system level, but noticeably lower agreement at the individual example level. This reveals strengths and limitations of using models to automate evaluation, since human feedback remains the gold standard.
Using a tournament-style rating system based on head-to-head comparisons, with Elo ratings computed from the results, the 65 billion and 33 billion parameter Guanaco models achieved the highest scores, expected to outperform ChatGPT around 30% of the time according to the aggregated human judgments. This represents a remarkable achievement given the limited compute resources available compared to industrial labs.
The authors also conduct extensive qualitative analysis by interacting directly with the trained models. They identify salient strengths such as high factual recall of simple knowledge, refusing to accept blatant falsehoods, and sophisticated reasoning about beliefs and intentions known as theory of mind. But they also surface concerning weaknesses, like struggling with obscure facts, susceptibility to manipulation to reveal secrets, and making unsupported assumptions. Many findings highlight the lack of deep semantic understanding, nuanced reasoning, and complex inferential abilities despite impressive surface capabilities. This underscores the need for developing more robust and rigorous analysis techniques as well as creating challenges targeted at exposing models' fundamental limitations.
Nonetheless, by fully open sourcing all code, data, and models for full reproducibility rather than making only APIs available, this work significantly expands the frontier of what is possible in LLM research with modest academic computing resources. The techniques concretely democratize access to cutting-edge large language model training that previously required budgets of millions of dollars for clusters of hundreds of top-tier GPUs.
By making it feasible to finetune these massive models on a single consumer-grade GPU costing just thousands of dollars rather than server farms, an immense array of novel applications becomes viable, including on-device fine-tuning to enable privacy-preserving conversational agents. At the same time, all the known risks and concerns around large language models remain in full force, and this study encourages much further analysis by the research community thanks to its unprecedented openness.
Under the hood, QLORA employs paged optimizers to gracefully handle the memory spikes that normally occur during gradient checkpointing by transparently paging chunks of optimizer state between GPU and CPU memory as needed. This builds on the NVIDIA unified memory subsystem which automatically copies pages on demand between CPU and GPU. Unified paging enables models to train without crashes despite surpassing the physical GPU memory capacity when checkpoints are active. And importantly, it requires no changes to model code compared to regular optimizers.
领英推荐
At a high level, the QLORA method works as follows. First, the weights of a pre-trained language model are quantized to 4-bits using the NormalFloat encoding. This produces a 4-bit base model that retains the full model architecture but with weights stored in a compact quantized format. Next, low-rank adapter layers are added throughout the base model, and these small adapters are kept in full 16-bit floating point precision. The 4-bit base model weights remain frozen.
Then, fine tuning is performed by backpropagating gradients through the frozen 4-bit quantized weights into the flexible 16-bit adapter layers and updating only adapters. This tunes the model for a downstream task while minimizing memory footprint. Thanks to efficient use of 4-bit and mixed precision, the GPU memory needed is drastically reduced compared to finetuning the full model in 16-bit.
Finally, the fine tuned model can be deployed for inference by first dequantizing the 4-bit base model weights into 16-bit on the fly and then passing activations through the usual inference process. The adapter layers seamlessly integrate into the architecture without changes to the coding framework used.
Concretely, the QLORA method employs the following key techniques:
Together, these technologies enable QLORA to unlock unprecedented efficiency and scalability for finetuning state-of-the-art large language models. The reductions in memory footprint directly translate to lower costs, power consumption, and carbon emissions thanks to smaller GPU requirements.
Since QLORA fine-tuned models retain the same inference performance as full-precision models, end users receive no disadvantage. Only the training process is optimized. This means downstream applications can benefit from the impressive capabilities unlocked by finetuning massive models without requiring expensive hardware themselves.
Furthermore, by open sourcing high performance generative models and the toolkit to train them, this work ameliorates issues around access and transparency that arise when only a private API is available. Wider access facilitates better auditing and reduces concentration of power dynamics. Of course, risks remain, and openness alone is not a panacea, but combined with cooperation among stakeholders, progress seems attainable.
The contributions of this work can be summarized as follows:
While this work represents a major advance, some limitations provide opportunities for future work:
In conclusion, QLORA represents a breakthrough method that overcomes the longstanding hurdle of efficiently finetuning the largest language models. It dismantles key bottlenecks by reducing memory requirements enough to run on a single GPU rather than server farms.
This work tangibly democratizes access to state-of-the-art natural language processing capabilities by making them attainable with modest academic resources instead of millions of dollars of equipment. The techniques directly facilitate novel applications like on-device finetuning for user privacy.?
By transparently open sourcing all aspects of this work including data, code, and models, the authors exemplify ideals of ethics and transparency. This enables the broader community to build on these findings and directly interrogate the models in detail far beyond what APIs permit.
Of course, recognizing the dual use nature of language AI, risks remain ever present. But progress on issues like alignment seems more viable when capabilities are not concentrated in closed organizations. There is still much work ahead, but by efficiently scaling up fine-tuning, QLORA meaningfully moves the needle towards enabling safe, widely beneficial language technologies.
In just the few years since models like GPT-3 ushered in the era of foundation models, capabilities have advanced at a frenetic pace thanks to massive datasets and scaling. QLORA provides the missing piece that makes finetuning critical for fully leveraging these immense models, not just pretrained inference. By reducing the resource requirements for state-of-the-art finetuning by over an order of magnitude, QLORA catalyzes a new phase where these powerful models become pervasively accessible. This has far-reaching implications for industries, academics, regulators, and users alike.
With crowdsourced data and cloud services, small players can now build conversational agents using the largest models. Startups can leverage pre-trained models more flexibly with tailored finetuning. Researchers can probe for fundamental weaknesses at scale. Watchdogs can audit for issues. And users may benefit from more agency over their experiences.
While far from the final word, QLORA tangibly moves the needle in the direction of democratization. There is no silver bullet, but combining openness, empiricism, ethics, and scientific discourse seems essential for progress. This work helps set the stage for the next era of language AI after initial breakthroughs in foundations models like GPT-3. There is still an immense amount we do not understand, but QLORA takes an important step to make charting the path ahead more inclusive and transparent. What possibilities we unlock together through co-creation remains to be seen.