Using offline coprocessor enhances LLM's KV-cache by adding extra "latent embeddings"
TuringPost
Newsletter about AI and ML. ?? Sign up for free to get your list of essential AI resources ??
Google DeepMind proposed a method that enhances LLMs with an offline coprocessor that works with the models' internal memory (kv-cache).
What's the coprocessor's role?
It enhances the model's KV-cache by adding extra "latent embeddings" (compressed representations) for more accurate outputs.
? What is good about it?
- The coprocessor operates independently, and the base LLM remains frozen.
- It operates offline and asynchronously, meaning it can improve the model’s memory in the background.
- If the coprocessor isn’t available or extra computation isn’t needed, the model still functions as usual.
- The model achieves lower perplexity.
- This method works across various tasks without additional fine-tuning.
Here are the details:
The interaction between the LLM and the coprocessor happens in 3 main steps:
Results of using coprocessor:
? Testing on reasoning-heavy tasks showed:
- 10.05% improvement on math reasoning (GSM8K).
- 4.70% improvement on multitask language understanding (MMLU) (multitask language understanding).
? These gains were achieved without any fine-tuning for specific tasks, highlighting the versatility of the method.
? This method also showed significantly lower perplexity compared to baseline models.
Paper: https://arxiv.org/pdf/2412.17747