登录查看更多内容

How to make LLMs more memory-efficient?

Cristiano De Nobili, PhD

Physicist ∣↑↓? | Lead AI Scientist | Lecturer & Speaker

发布日期: 2024年7月3日

This is the 7th article of Beyond Entropy, a space where the chaos of the future, the speed of emerging technologies and the explosion of opportunities are slowed down, allowing us to turn (qu)bits into our dreams.

Today's post will focus on some recent methods to make LLMs more memory efficient. The outline will be as follows:

How to make LLMs more memory-efficient?
A bunch of interesting resources;
?? Job & Research opportunities, talks, and events in AI.

You can read the full version of Beyond Entropy here with more references and details. Let’s start!

How to make LLMs more memory-efficient?

Despite the enormous success of proprietary LLM-based AI assistants such as GPT-4o or Claude 3.5 Sonnet, most startups, companies and institutions have to run their own customised version of open-source models such as Llama-3 70B or Mixtral 8x22B.

However, high-performance models and their immense size pose significant challenges, such as huge training and inference costs, significant power requirements and considerable memory limitations for on-site deployment. To address this problem, many interesting approaches have recently focused on compressing and improving the computational and memory efficiency of LLMs.

The most popular paradigm, called Parameter-efficient Fine-tuning (PEFT), attempts to adapt LLMs by updating a small number of parameters. For example, the well-known Low-Rank Adaptation (LoRA) is a technique that adds a trainable low rank matrix to the frozen pre-trained weights in each layer, reducing the number of trainable parameters. However, there is often a performance gap between PEFT methods and full parameters tuning, where the former approaches are typically underperforming the latter.

Below, I would like to share three interesting recent works that go beyond the traditional LoRA methods (and their quantized version QLoRA) and aim to minimise this accuracy gap by improving the efficiency of full fine tuning:

Sparse Matrix Tuning , Haoze He et al.: the paper identifies most significant sub-matrices in gradient update, updating only these blocks during the fine-tuning process;
Gradient Low-Rank Projection by Beidi Chen , Anima Anandkumar , Yuandong Tian and collaborators: a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA;
Representation Fine-Tuning by Stanford NLP Group: Given that representations (initial data processed by the network layers) encode rich semantic informations, authors develop methods to edit them instead of updating weights by operating on a frozen base model and learn task-specific interventions on hidden representations (GitHub repository).

Keeping up with the growing number of new methods is a challenge. That is why I recommend you consult this repository, Efficient LLMs by Mi Zhang 's group, where the most important papers on the topic of memory- and energy-efficient are constantly posted. In addition, in the full version of Beyond Entropy you can find other two memory-friendly methods that cleverly exploit Mixtures of Experts (MoEs) and Quantum Tensor Networks (by the amazing startup Multiverse Computing )!

To complete, it is worth mentioning Berkley's vLLM project , an open-source library for LLM inference and serving.

MIT Technology Review 8 个月前

Today’s AI is ‘alchemy,’ not science — what that means…

VentureBeat 1 年前

Learn AI Together - Towards AI Community Newsletter #20

Towards AI 7 个月前

Interesting resources

Here is a selection of valuable resources:

If you are an AI developer and want to get started with small-scale LLM projects, read this great blog-post series Yoga LLM & Yoga VLM by Vijayasri Iyer ;
Another interesting blog-series on LLMs: What We Learned from a Year of Building with LLMs by Eugene Yan and collaborators;
Often, especially when working on European funded AI projects, you are asked to estimate the carbon footprint of the models you are delivering. ML CO2 Impact is an easy to use tool to answer that question.

Opportunities, talks, and events

I share some opportunities from my network that you might find interesting:

?? Job opportunities:

Pi School is looking for a legal intern to work on AI compliance for two European projects, mitigating risks and promoting innovation within a sound legal framework and the AI Act (job details ).
ContinualIST , a newly founded efficient-AI startup, will soon open several positions. Do not miss to look their website (via Vincenzo Lomonaco, PhD );
Agricola Moderna , an Italian vertical farming company, is looking for a Plant Scientists for their R&D department.
AI Freelancers: if you are looking for short/medium term opportunities in Europe fill this form or spread it .

?? Research opportunities:

One PhD position in Quantum Analog Computing , and two Postdocs positions in Quantum Optimization at Pasqal and the University of Sherbrooke (via Lorenzo Niola );
Postdoctoral position at European Space Agency - ESA Φ-lab in AI for Earth Observation and Hydrology. Check Nicolas Longépé ’s recent post .

?? Other opportunities or events:

Call for startups in Quantum Computing for Earth Observation (QC4EO) launched by ESA (check also Sabrina Ricci ’s recent post );
Upcoming talk on Graph Theory for Orchestrating LLM Workflows by Ahmad Albarqawi at Pi School next week!

Thanks for reading! You can find the full version of Beyond Entropy here with further references and details. Until the next post!

How to make LLMs more memory-efficient?

Cristiano De Nobili, PhD

Physicist ∣↑↓? | Lead AI Scientist | Lecturer & Speaker

How to make LLMs more memory-efficient?

领英推荐

Interesting resources

Opportunities, talks, and events

Beyond Entropy

2,125 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Decoding Deep Learning with Yoshua Bengio

State of the Art in AI Research

Decoding Deep Learning with Yoshua Bengio

Beyond Simple Patterns and Predetermined Actions: The Journey from Perception to True Intelligence in AI

Decoding closed box Models with LIME (Local Interpretable Model-Agnostic Explanations)

Neurosymbolic AI and Fuzzy Logic

“Deep Learning is Rubbish” — Karl Friston & Yann LeCun Face Off at Davos 2024 World Economic Forum

The Layers of AI - Exploring the Whole System

Dear researchers, AI is coming for your lab coats

Is Attention All You Need? A Look at Hyena

How to make LLMs more memory-efficient?

领英推荐

Interesting resources

Opportunities, talks, and events

Beyond Entropy

2,125 位关注者

Decoding the Energy Efficiency of Quantum Hardwares

2024年10月10日

Addition bias: playing LEGO with LLMs

2024年9月19日

Decoding the Energy Footprint of AI

2024年8月7日

The era of Artificial Collective Intelligence (ACI) is about to start

2024年5月9日

Thermodynamic Computing: new hardwares for future AI algorithms

2024年4月3日

Tuning LLMs: a galaxy of endless possibilities

2024年2月14日

Quantum Computing: enthusiasm and scepticism still coexist

2024年1月3日

How can Philosophy elevate LLMs?

2023年10月31日

A dream LLM can turn into a deployment nightmare

2023年8月29日

社区洞察

其他会员也浏览了

Decoding Deep Learning with Yoshua Bengio

State of the Art in AI Research

Decoding Deep Learning with Yoshua Bengio

Beyond Simple Patterns and Predetermined Actions: The Journey from Perception to True Intelligence in AI

Decoding closed box Models with LIME (Local Interpretable Model-Agnostic Explanations)

Neurosymbolic AI and Fuzzy Logic

“Deep Learning is Rubbish” — Karl Friston & Yann LeCun Face Off at Davos 2024 World Economic Forum

The Layers of AI - Exploring the Whole System

Dear researchers, AI is coming for your lab coats

Is Attention All You Need? A Look at Hyena