Topic 18: What is Mixture-of-Depths?
TuringPost
Newsletter about AI and ML. ?? Sign up for free to get your list of essential AI resources ??
Explore how transformers can select different depths of processing and reduce compute needs
You may have noticed that transformers don’t regulate the computing power allocated to each token or word in a sequence. However, some words require more attention, while others can be bypassed as they contribute less (just as in real life). To achieve efficient computational use, researchers from Google DeepMind, McGill University, and Mila have proposed a new method – Mixture-of-Depths (MoD). This approach optimizes how transformers use computing power (FLOPs) by identifying which parts of an input sequence need more focus. Let’s dive deeper into the “depths” of transformers, exploring how MoD works, its impact on overall performance, and how it might benefit you.
In today’s episode, we will cover:
Too much computation needs in Transformers
Transformer models usually spend the same amount of computing power on each word, which is not necessary. Transformers could save power if they focus only on words that need extra processing.
Conditional computation is a common approach to manage computing power in transformers, often through techniques like early exiting and Mixture-of-Experts (MoE). Early exiting allows the transformer to stop processing certain tokens once they have received sufficient attention, skipping remaining layers. In MoE transformers, tokens are directed only to the experts they need, with other parts of the model left inactive for those tokens.
Can something be more efficient than the approaches we already have? How can we make transformers focus only on tokens that need more processing?
Here comes Mixture-of-Depths (MoD)