登录查看更多内容

OLMoE: Open Mixture-of-Experts Language Models

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2024年9月4日

Today's paper introduces OLMoE, a new open-source language model that uses a Mixture-of-Experts (MoE) architecture. OLMoE-1B-7B has 7 billion total parameters but only uses 1 billion per input token, allowing it to achieve strong performance while being more efficient than traditional dense models. The authors release all aspects of their work openly, including model weights, training data, code and logs.

Method Overview

OLMoE uses a Mixture-of-Experts (MoE) architecture, which consists of multiple "expert" neural networks that specialize in processing different types of inputs. For each input, only a subset of these experts is activated, allowing the model to use fewer parameters per input while maintaining a large total parameter count.

The model has 64 experts per layer, but only 8 are activated for each input token. This allows OLMoE-1B-7B to have 7 billion total parameters while only using about 1 billion active parameters per input. A learned "router" network determines which experts to use for each input.

The authors pretrained OLMoE-1B-7B on 5.1 trillion tokens using a mix of web pages, code, scientific papers, and other text sources. They then fine-tuned it for instruction-following and preference learning to create OLMoE-1B-7B-INSTRUCT.

Results

OLMoE-1B-7B outperforms all available models with similar active parameter counts (around 1 billion) and even surpasses some larger models like Llama2-13B-Chat on certain benchmarks. The instruction-tuned version, OLMoE-1B-7B-INSTRUCT, performs competitively with models that have significantly more parameters.

Badishagandu Ravinder 6 个月前

Fine-Tuning Made Easy: The Game-Changing Benefits of…

Amogh S. 1 个月前

GraphRAG with Large Language Models (LLM)

Kalai Shakrapani 7 个月前

The authors find that MoEs train about 2 times faster than dense models with equivalent active parameters. They also observe that experts in the model specialize in different domains and vocabulary, allowing for efficient use of the model's capacity.

Conclusion

OLMoE demonstrates that Mixture-of-Experts models can achieve strong performance while being more efficient than traditional dense language models. By open-sourcing all aspects of their work, the authors aim to facilitate further research and development of MoE models in the broader AI community. For more information please consult the?full paper.

Congrats to the authors for their work!

Muennighoff, Niklas, et al. "OLMoE: Open Mixture-of-Experts Language Models." arXiv preprint arXiv:2409.02060 (2024).

OLMoE: Open Mixture-of-Experts Language Models

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Method Overview

Results

领英推荐

Conclusion

AI Paper of the Day

900 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Integration of Graph Encoding and Modular Prompting: Enabling Retrieval Augmented Generation for Language Models

PersonalGPT: Building large language model (LLM) applications using LangChain, OpenAI and Nodejs

Understanding Large Language Models: The History of LLMs - Part 1

TOKENIZATION FOR DATA PREPARATION

Top LLM Papers of the Week (July Week 2, 2024)

Top AI/ML Papers of the Week [03/06 - 09/06]

Demystifying AI Architecture: Understanding the Architecture of Large Language Models & Transformers in Simple Terms

Exploring the Edges: Utilizing Prompt Engineering for Text Classification in Large Language Models

Revolutionizing Large Language Models with 1-Bit Transformers: BitLinear and BitNet b1.58

Transformer Architecture

Method Overview

Results

领英推荐

Conclusion

AI Paper of the Day

900 位关注者

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

2024年10月12日

Aria: An Open Multimodal Native Mixture-of-Experts Model

2024年10月11日

Pixtral 12B

2024年10月10日

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

2024年10月9日

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

2024年10月8日

Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

2024年10月7日

LLaVA-Critic: Learning to Evaluate Multimodal Models

2024年10月6日

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

2024年10月5日

Movie Gen: A Cast of Media Foundation Models

2024年10月4日

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

2024年10月3日

社区洞察

其他会员也浏览了

Integration of Graph Encoding and Modular Prompting: Enabling Retrieval Augmented Generation for Language Models

PersonalGPT: Building large language model (LLM) applications using LangChain, OpenAI and Nodejs

Understanding Large Language Models: The History of LLMs - Part 1

TOKENIZATION FOR DATA PREPARATION

Top LLM Papers of the Week (July Week 2, 2024)

Top AI/ML Papers of the Week [03/06 - 09/06]

Demystifying AI Architecture: Understanding the Architecture of Large Language Models & Transformers in Simple Terms

Exploring the Edges: Utilizing Prompt Engineering for Text Classification in Large Language Models

Revolutionizing Large Language Models with 1-Bit Transformers: BitLinear and BitNet b1.58

Transformer Architecture