?? Introducing "GaLore-v2" or Q-GaLore: A Latest Milestone in Low-Rank LLM Training ??

?? Introducing "GaLore-v2" or Q-GaLore: A Latest Milestone in Low-Rank LLM Training ??

We are incredibly thrilled to introduce Q-GaLore (or "GaLore v2"), a major upgrade and advancement following our well-received GaLore algorithm introduced in February 2024.

?? TL;DR; ??

  • Q-Galore to GaLore is like Q-LoRA to LoRA: it is your GaLore quantized from end to end! 8-bit weight, 8-bit optimizer state, and 4-bit projection, plus other innovative savings in SVD!
  • At pre-training, Q-GaLore facilitates training a LLaMA-7B model from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB memory (down from GaLore at 24GB).
  • At fine-tuning, Q-GaLore reduces memory consumption by up to 50% compared to LoRA and GaLoRe, while consistently outperforming QLoRA (by up to 5.19 on MMLU) at the same memory cost.

?? Further Pointers ??


?? The Challenge ??

Large Language Models (LLMs) are incredibly powerful, but training or even fine-tuning these models typically requires resources beyond the reach of most individuals and research groups. High-end GPUs are a necessity, but can consumer-grade GPUs rise to the challenge?


?? The Insight ??

It’s widely acknowledged that LLM training demands vast memory and communication capabilities due to the massive datasets and huge storage for model parameters, gradients, and optimizer states.

While H/A100 GPUs are standard for data centers, consumer-grade GPUs like the NVIDIA RTX 4090, often found in high-end desktops/laptops, offer a tantalizing possibility to democratizing LLM training to everyone!

As highlighted by Naddod, “… the biggest difference between H/A100 and 4090 lies in memory and communication, not in computation power!”

Any chance to save us from hitting the GPU “memory wall”? Our answer: large model, but small gradients! By using inherent low gradient rank…


?? Introducing GaLore ??

One most prominent low-rank method for LLMs is LoRA, which introduces low-rank weight adapters for each layer, reducing the memory footprint by only optimizing the adapters. However, LoRA supports only fine-tuning but not pre-training. It is also an ad-hoc solution that changes the original optimization problem.

In contrast to LoRA, GaLore reduces memory by projecting the optimizer states and gradients into a lower-dimensional subspace, achieved via Singular Value Decomposition (SVD). GaLore made significant strides by:

  • Training models up to 7 billion parameters on consumer GPUs like the NVIDIA RTX 4090
  • Reducing memory for storing optimizer states by up to 82.5%
  • Combining with 8-bit optimizers for maximized memory efficiency
  • Outperforming LoRA on GLUE and Pretraining of Llama on C4
  • Integrating into Hugging Face Transformers with galore_adamw or galore_adamw_8bit
  • Further reducing memory footprint with layer-wise updates


?? Move beyond GaLore? ??

While training a 7B LLM under 24GB VRAM is an incredible step forward, popular desktop and laptop GPUs like the RTX 4060 Ti, equipped with up to 16GB of memory, are much more accessible and cost-effective. For instance, as of August 2023, an RTX 4060 Ti costs $499, compared to the $1599 price tag of an RTX 4090.

Recognizing this, we set 16GB memory for 7B LLM training as our next goal. Achieving this required a careful and detailed study on how to effectively quantize each part of the GaLore algorithms, pushing the boundaries of what’s possible with more affordable hardware.


?? Taking It Further: Q-GaLore ??

Building on GaLore, we present Q-GaLore, which combines quantization and low-rank projection to dramatically reduce memory usage. Our method is based on two key observations:

  1. GaLore requires regular updates to the gradient subspace through computationally expensive SVD operations (e.g., every 200 iterations). It would take ~10 minutes for the LLaMA-7B model to just update the subspace once. However, we discover that the gradient subspace exhibits diverse properties per layer, with some layers converging remarkably early in training while others frequently change towards the end.
  2. We also find that the projection matrices are numerically resilient to low-bit quantization. However, low-precision training tends to yield training instability and failures to closely track the trajectory of high-precision training (that the original GaLore strives to do).

Q-GaLore hereby present two key innovations that take memory efficiency to the next level.

  • Adaptive Lazy Subspace Update: Efficiently or "lazily" updating the gradient subspace only when needed, based on layer-wise convergence statistics, reducing the number of computationally expensive SVD operations (by over 60%). This alone saves over 32 hours for training a 7B model!
  • Low Precision Training and Projection: Maintaining projection matrices in INT4 format and weights in INT8 format, with stochastic rounding (allowing the low-precision parameters to implicitly accumulate small gradient information), enabling high-precision training trajectories using low-precision weights. This further leads to >28% in memory usage!


?? Performance Highlights ??

  • Facilitates pre-training a LLaMA-7B model from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB memory
  • Reduces memory consumption by up to 50% compared to LoRA and GaLore during fine-tuning
  • Consistently outperforms QLoRA (by up to 5.19 on MMLU) at the same memory cost

More to be discovered in our paper! https://arxiv.org/abs/2407.08296


?? Get Started ??

Q-GaLore is available on GitHub and can be installed directly via PyPI (just pip it!)

Our upcoming plans include integrating Q-GaLore into the Hugging Face Transformer library and the LLaMA Factory, as well as scaling it to multi-GPUs with FSDP and DeepSpeed. Stay tuned!

Join us on this exciting journey to democratize LLM training, making it more accessible and efficient than ever before!


Thanks to the team! Zhenyu (Allen) Zhang Ajay Jaiswal Lu Yin Shiwei Liu Jiawei Zhao Yuandong Tian

#AI #MachineLearning #LLMs #QGaLore #GaLore #DeepLearning #Innovation #Research #NVIDIA #HuggingFace

José Henry León

Quantitative Analysis, Machine Learning, Quantum Computing

4 个月

This seems rank reductions to the gradients/weights. Question how do you select the appropriate rank decomposition level ?

回复

how much time took for training? and domodel under 7 like 3 or 1 billion require less memory

回复

Very impressive. I can potentially test this for the memory summary layer of MetaLearner.

梦迪 王

普林斯顿大学机器学习中心 教授

4 个月

Very helpful!

Louis Scott

Quantitative finance leader specializing in long-term wealth growth and downside protection. Director with expertise in data-driven strategy, stakeholder management, and leading teams to deliver superior performance.

4 个月

The drop in SVD updates, numerical stability sounds quite cool. Thanks for this Atlas!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了