How to make LLMs more memory-efficient?
Cristiano De Nobili, PhD
Physicist ∣↑↓? | Lead AI Scientist | Lecturer & Speaker
This is the 7th article of Beyond Entropy, a space where the chaos of the future, the speed of emerging technologies and the explosion of opportunities are slowed down, allowing us to turn (qu)bits into our dreams.
Today's post will focus on some recent methods to make LLMs more memory efficient. The outline will be as follows:
You can read the full version of Beyond Entropy here with more references and details. Let’s start!
How to make LLMs more memory-efficient?
Despite the enormous success of proprietary LLM-based AI assistants such as GPT-4o or Claude 3.5 Sonnet, most startups, companies and institutions have to run their own customised version of open-source models such as Llama-3 70B or Mixtral 8x22B.
However, high-performance models and their immense size pose significant challenges, such as huge training and inference costs, significant power requirements and considerable memory limitations for on-site deployment. To address this problem, many interesting approaches have recently focused on compressing and improving the computational and memory efficiency of LLMs.
The most popular paradigm, called Parameter-efficient Fine-tuning (PEFT), attempts to adapt LLMs by updating a small number of parameters. For example, the well-known Low-Rank Adaptation (LoRA) is a technique that adds a trainable low rank matrix to the frozen pre-trained weights in each layer, reducing the number of trainable parameters. However, there is often a performance gap between PEFT methods and full parameters tuning, where the former approaches are typically underperforming the latter.
Below, I would like to share three interesting recent works that go beyond the traditional LoRA methods (and their quantized version QLoRA) and aim to minimise this accuracy gap by improving the efficiency of full fine tuning:
Keeping up with the growing number of new methods is a challenge. That is why I recommend you consult this repository, Efficient LLMs by Mi Zhang 's group, where the most important papers on the topic of memory- and energy-efficient are constantly posted. In addition, in the full version of Beyond Entropy you can find other two memory-friendly methods that cleverly exploit Mixtures of Experts (MoEs) and Quantum Tensor Networks (by the amazing startup Multiverse Computing )!
To complete, it is worth mentioning Berkley's vLLM project , an open-source library for LLM inference and serving.
领英推荐
Interesting resources
Here is a selection of valuable resources:
Opportunities, talks, and events
I share some opportunities from my network that you might find interesting:
?? Job opportunities:
?? Research opportunities:
?? Other opportunities or events:
Thanks for reading! You can find the full version of Beyond Entropy here with further references and details. Until the next post!