OLMoE: Open Mixture-of-Experts Language Models
Credit: https://arxiv.org/pdf/2409.02060

OLMoE: Open Mixture-of-Experts Language Models

Today's paper introduces OLMoE, a new open-source language model that uses a Mixture-of-Experts (MoE) architecture. OLMoE-1B-7B has 7 billion total parameters but only uses 1 billion per input token, allowing it to achieve strong performance while being more efficient than traditional dense models. The authors release all aspects of their work openly, including model weights, training data, code and logs.

Method Overview

OLMoE uses a Mixture-of-Experts (MoE) architecture, which consists of multiple "expert" neural networks that specialize in processing different types of inputs. For each input, only a subset of these experts is activated, allowing the model to use fewer parameters per input while maintaining a large total parameter count.

The model has 64 experts per layer, but only 8 are activated for each input token. This allows OLMoE-1B-7B to have 7 billion total parameters while only using about 1 billion active parameters per input. A learned "router" network determines which experts to use for each input.

The authors pretrained OLMoE-1B-7B on 5.1 trillion tokens using a mix of web pages, code, scientific papers, and other text sources. They then fine-tuned it for instruction-following and preference learning to create OLMoE-1B-7B-INSTRUCT.

Results

OLMoE-1B-7B outperforms all available models with similar active parameter counts (around 1 billion) and even surpasses some larger models like Llama2-13B-Chat on certain benchmarks. The instruction-tuned version, OLMoE-1B-7B-INSTRUCT, performs competitively with models that have significantly more parameters.

The authors find that MoEs train about 2 times faster than dense models with equivalent active parameters. They also observe that experts in the model specialize in different domains and vocabulary, allowing for efficient use of the model's capacity.

Conclusion

OLMoE demonstrates that Mixture-of-Experts models can achieve strong performance while being more efficient than traditional dense language models. By open-sourcing all aspects of their work, the authors aim to facilitate further research and development of MoE models in the broader AI community. For more information please consult the?full paper.

Congrats to the authors for their work!

Muennighoff, Niklas, et al. "OLMoE: Open Mixture-of-Experts Language Models." arXiv preprint arXiv:2409.02060 (2024).

要查看或添加评论,请登录

社区洞察

其他会员也浏览了