Mixture-of-Transformers (MoT)

Mixture-of-Transformers (MoT)

Mixture-of-Transformers (MoT) is a new AI at Meta and 美国斯坦福大学 's MLLM design for efficient training of MLLMs using less computing power.

It uses specific networks for each type of input: text, images, and speech, while still sharing attention across all the data. ??

Token processing: The data is first converted into a sequence of tokens. Each token is assigned a label identifying its type (e.g., text, image, speech). This helps the model know which processing path to use. Then tokens are processed using modality-specific parameters.

Global self-attention: Shared global self-attention mechanism allows all data types to interact. This helps to understand how they relate to each other.

Specified feed-forward networks: After the attention step, tokens are passed through feed-forward networks, which refine the information. MoT uses different FFNs for each data type.

Layer normalization: This step ensures the model remains stable during training. In MoT, modality-specific layer normalization is used, so each data type gets its own specialized normalization process.


MoT's results:

? Text and image Tasks: MoT performed as well as traditional models but used 45% less computing power.

? With speech included, MoT achieved similar performance to the dense model with only 37.2% of the FLOPs.

? Specialized tasks: In tasks where text and images required different training methods, MoT outperformed larger traditional models using much less computing power.

Image credit: Original Paper

Overall, MoT makes training faster and cheaper. On high-powered GPUs, MoT can train in about half the time it takes for traditional models.


Paper: https://arxiv.org/pdf/2411.04996

要查看或添加评论,请登录