?? Introducing "GaLore-v2" or Q-GaLore: A Latest Milestone in Low-Rank LLM Training ??
We are incredibly thrilled to introduce Q-GaLore (or "GaLore v2"), a major upgrade and advancement following our well-received GaLore algorithm introduced in February 2024.
?? TL;DR; ??
?? Further Pointers ??
?? The Challenge ??
Large Language Models (LLMs) are incredibly powerful, but training or even fine-tuning these models typically requires resources beyond the reach of most individuals and research groups. High-end GPUs are a necessity, but can consumer-grade GPUs rise to the challenge?
?? The Insight ??
It’s widely acknowledged that LLM training demands vast memory and communication capabilities due to the massive datasets and huge storage for model parameters, gradients, and optimizer states.
While H/A100 GPUs are standard for data centers, consumer-grade GPUs like the NVIDIA RTX 4090, often found in high-end desktops/laptops, offer a tantalizing possibility to democratizing LLM training to everyone!
As highlighted by Naddod, “… the biggest difference between H/A100 and 4090 lies in memory and communication, not in computation power!”
Any chance to save us from hitting the GPU “memory wall”? Our answer: large model, but small gradients! By using inherent low gradient rank…
?? Introducing GaLore ??
One most prominent low-rank method for LLMs is LoRA, which introduces low-rank weight adapters for each layer, reducing the memory footprint by only optimizing the adapters. However, LoRA supports only fine-tuning but not pre-training. It is also an ad-hoc solution that changes the original optimization problem.
In contrast to LoRA, GaLore reduces memory by projecting the optimizer states and gradients into a lower-dimensional subspace, achieved via Singular Value Decomposition (SVD). GaLore made significant strides by:
?? Move beyond GaLore? ??
While training a 7B LLM under 24GB VRAM is an incredible step forward, popular desktop and laptop GPUs like the RTX 4060 Ti, equipped with up to 16GB of memory, are much more accessible and cost-effective. For instance, as of August 2023, an RTX 4060 Ti costs $499, compared to the $1599 price tag of an RTX 4090.
领英推荐
Recognizing this, we set 16GB memory for 7B LLM training as our next goal. Achieving this required a careful and detailed study on how to effectively quantize each part of the GaLore algorithms, pushing the boundaries of what’s possible with more affordable hardware.
?? Taking It Further: Q-GaLore ??
Building on GaLore, we present Q-GaLore, which combines quantization and low-rank projection to dramatically reduce memory usage. Our method is based on two key observations:
Q-GaLore hereby present two key innovations that take memory efficiency to the next level.
?? Performance Highlights ??
More to be discovered in our paper! https://arxiv.org/abs/2407.08296
?? Get Started ??
Q-GaLore is available on GitHub and can be installed directly via PyPI (just pip it!)
Our upcoming plans include integrating Q-GaLore into the Hugging Face Transformer library and the LLaMA Factory, as well as scaling it to multi-GPUs with FSDP and DeepSpeed. Stay tuned!
Join us on this exciting journey to democratize LLM training, making it more accessible and efficient than ever before!
Thanks to the team! Zhenyu (Allen) Zhang Ajay Jaiswal Lu Yin Shiwei Liu Jiawei Zhao Yuandong Tian
#AI #MachineLearning #LLMs #QGaLore #GaLore #DeepLearning #Innovation #Research #NVIDIA #HuggingFace
Quantitative Analysis, Machine Learning, Quantum Computing
4 个月This seems rank reductions to the gradients/weights. Question how do you select the appropriate rank decomposition level ?
3d generalist
4 个月how much time took for training? and domodel under 7 like 3 or 1 billion require less memory
Nico
4 个月Very impressive. I can potentially test this for the memory summary layer of MetaLearner.
普林斯顿大学机器学习中心 教授
4 个月Very helpful!
Quantitative finance leader specializing in long-term wealth growth and downside protection. Director with expertise in data-driven strategy, stakeholder management, and leading teams to deliver superior performance.
4 个月The drop in SVD updates, numerical stability sounds quite cool. Thanks for this Atlas!