Topic 31: How to Reduce Memory Use in Reasoning Models

Topic 31: How to Reduce Memory Use in Reasoning Models

we explore how combining LightThinker and Multi-Head Latent Attention cuts memory and boosts performance


AI models have shifted from thinking quickly (giving fast answers) to thinking more carefully by breaking problems into smaller steps. o1-like thinking with implementing Chain-of-Thoughts method allows large reasoning models, such as OpenAI’s o1, o3, and DeepSeek-R1, to backtrack, retry, and refine its reasoning, making it even better at solving tricky problems. We discussed all important aspects and advantages of scaling test-time compute in one of our previous episodes. However, there is a big issue: this kind of reasoning creates a lot of text (tokens), which takes up memory and slows things down, making processing more expensive. This is especially noticeable with Transformers – the more text they generate, the more memory and computing power they need. As large reasoning models become more prevalent, we must find ways to mitigate their weaknesses while fully exploring their potential for improvement.

Today we will focus on the problem of increased memory use and, as a result, too long processing time because of this. If we can address memory inefficiency, models can become more balanced and effective while maintaining their high accuracy. Two notable approaches have already been proposed to reduce memory usage in reasoning models: 1) LightThinker that helps models learn how to summarize their own “thoughts” and solve tasks based on these short, meaningful summarizations; and 2) Multi-head Latent Attention (MLA), a DeepSeek solution, proposed back when they released DeepSeek-V2 and later implemented in DeepSeek-V3 and DeepSeek-R1.

Today we invite you to dive into these concepts with us and consider the potential benefits of blending them together.

In today’s episode, we will cover:

  • What is LightThinker?
  • What is Multi-Head Latent Attention (MLA)?
  • What if we blend LightThinker and MLA concepts?
  • Conclusion
  • Sources and further reading: explore the references used to write this article and dive deeper with all the links provided in this section


What is LightThinker?

The idea behind LightThinker

As we have already said, we need optimization methods that would make high-quality reasoning models much faster and more efficient, avoiding high memory cost.

One of these methods is LightThinker developed by Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph. LightThinker doesn’t just cut out words or memory manually, it teaches the model to "summarize" its own “thoughts” while solving problems. Think of it like how people jot down key points instead of writing every detail. Let’s look at how it works in detail.

Image Credit: The original LightThinker paper

How does LightThinker work?

In general, instead of keeping long, detailed reasoning steps, LightThinker compresses them into shorter, essential summaries and then continues reasoning based on them.

What’s important is that LightThinker does two things:

  • Decides when to compress reasoning steps.
  • Decides how to compress them.


You can READ THIS ARTICLE FOR FREE on our page on Hugging Face. Follow us there ??

Or upgrade if you want to be the first to receive the full articles with detailed explanations and curated resources directly in your inbox. Simplify your learning journey → UPGRADE


要查看或添加评论,请登录

TuringPost的更多文章

社区洞察