ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Topic 31: How to Reduce Memory Use in Reasoning Models

TuringPost

Newsletter about AI and ML. ?? Sign up for free to get your list of essential AI resources ??

å‘å¸ƒæ—¥æœŸ: 2025å¹´3æœˆ13æ—¥

we explore how combining LightThinker and Multi-Head Latent Attention cuts memory and boosts performance

AI models have shifted from thinking quickly (giving fast answers) to thinking more carefully by breaking problems into smaller steps. o1-like thinking with implementing Chain-of-Thoughts method allows large reasoning models, such as OpenAIâ€™s o1, o3, and DeepSeek-R1, to backtrack, retry, and refine its reasoning, making it even better at solving tricky problems. We discussed all important aspects and advantages of scaling test-time compute in one of our previous episodes. However, there is a big issue: this kind of reasoning creates a lot of text (tokens), which takes up memory and slows things down, making processing more expensive. This is especially noticeable with Transformers â€“ the more text they generate, the more memory and computing power they need. As large reasoning models become more prevalent, we must find ways to mitigate their weaknesses while fully exploring their potential for improvement.

Today we will focus on the problem of increased memory use and, as a result, too long processing time because of this. If we can address memory inefficiency, models can become more balanced and effective while maintaining their high accuracy. Two notable approaches have already been proposed to reduce memory usage in reasoning models: 1) LightThinker that helps models learn how to summarize their own â€œthoughtsâ€ and solve tasks based on these short, meaningful summarizations; and 2) Multi-head Latent Attention (MLA), a DeepSeek solution, proposed back when they released DeepSeek-V2 and later implemented in DeepSeek-V3 and DeepSeek-R1.

Today we invite you to dive into these concepts with us and consider the potential benefits of blending them together.

In todayâ€™s episode, we will cover:

What is LightThinker?
What is Multi-Head Latent Attention (MLA)?
What if we blend LightThinker and MLA concepts?
Conclusion
Sources and further reading: explore the references used to write this article and dive deeper with all the links provided in this section

What is LightThinker?

The idea behind LightThinker

As we have already said, we need optimization methods that would make high-quality reasoning models much faster and more efficient, avoiding high memory cost.

One of these methods is LightThinker developed by Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph. LightThinker doesnâ€™t just cut out words or memory manually, it teaches the model to "summarize" its own â€œthoughtsâ€ while solving problems. Think of it like how people jot down key points instead of writing every detail. Letâ€™s look at how it works in detail.

Image Credit: The original LightThinker paper

How does LightThinker work?

In general, instead of keeping long, detailed reasoning steps, LightThinker compresses them into shorter, essential summaries and then continues reasoning based on them.

Whatâ€™s important is that LightThinker does two things:

Decides when to compress reasoning steps.
Decides how to compress them.

You can READ THIS ARTICLE FOR FREE on our page on Hugging Face. Follow us there ??

Or upgrade if you want to be the first to receive the full articles with detailed explanations and curated resources directly in your inbox. Simplify your learning journey â†’ UPGRADE

Turing Post

2,422 ä½å…³æ³¨è€…

è®¢é˜…

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

TuringPostçš„æ›´å¤šæ–‡ç«

See all articles

What is LightThinker?

The idea behind LightThinker

How does LightThinker work?

Turing Post

2,422 ä½å…³æ³¨è€…

TuringPostçš„æ›´å¤šæ–‡ç«

??#92: Fight for Developers and the Year of Orchestration

????#14: What Is MCP, and Why Is Everyone â€“ Suddenly!â€“ Talking About It?

??#91: We are failing in AI literacy

????#13: Action! How AI Agents Execute Tasks with UI and API Tools

????#12: How Do Agents Learn from Their Own Mistakes? The Role of Reflection in AI

Topic 30: Everything You Need to Know about Knowledge Distillation

??#90: Why AIâ€™s Reasoning Tests Keep Failing Us

SWE-RL, the first reinforcement learning (RL) method for software engineering

Topic 29: Inside the family of Smol models

Self-Optimizing Models, and Humanoid Robots Are Reshaping 2025

ç¤¾åŒºæ´žå¯Ÿ

2,422 ä½å…³æ³¨è€…