Mixture-of-Transformers (MoT)
TuringPost
Newsletter about AI and ML. ?? Sign up for free to get your list of essential AI resources ??
Mixture-of-Transformers (MoT) is a new AI at Meta and 美国斯坦福大学 's MLLM design for efficient training of MLLMs using less computing power.
It uses specific networks for each type of input: text, images, and speech, while still sharing attention across all the data. ??
Token processing: The data is first converted into a sequence of tokens. Each token is assigned a label identifying its type (e.g., text, image, speech). This helps the model know which processing path to use. Then tokens are processed using modality-specific parameters.
Global self-attention: Shared global self-attention mechanism allows all data types to interact. This helps to understand how they relate to each other.
Specified feed-forward networks: After the attention step, tokens are passed through feed-forward networks, which refine the information. MoT uses different FFNs for each data type.
Layer normalization: This step ensures the model remains stable during training. In MoT, modality-specific layer normalization is used, so each data type gets its own specialized normalization process.
MoT's results:
? Text and image Tasks: MoT performed as well as traditional models but used 45% less computing power.
? With speech included, MoT achieved similar performance to the dense model with only 37.2% of the FLOPs.
? Specialized tasks: In tasks where text and images required different training methods, MoT outperformed larger traditional models using much less computing power.
Overall, MoT makes training faster and cheaper. On high-powered GPUs, MoT can train in about half the time it takes for traditional models.