登录查看更多内容

A Mixture of Experts: A revolutionary technique to boost generative AI performance?

Syed Quiser Ahmed

Head of Infosys Responsible AI Office | Member of ISO SC42 for AI | NIST Primary POC for AI Safety | Member of Forbes Technology Council

发布日期: 2024年4月18日

Mixture of experts (MoE) stands as a machine learning methodology wherein numerous expert networks (learners) collaborate to partition a problem domain into homogeneous regions. This approach diverges from ensemble techniques in that, typically, only one or a select few expert models are employed for each input in MoE, whereas ensemble techniques involve running all models on every input.

The MoE model is based on the concept of leveraging expertise from multiple specialized models, to get a superior performance. In the realm of machine learning, this principle is applied by utilizing multiple expert networks, each focusing on distinct facets of the data, to enhance overall performance. The essence of MoE lies in establishing a dynamic system where the unique strengths of different experts are utilized depending on the input data. This enables the generation of more adaptable and precise predictions compared to what a single model could deliver.

The foundation of Mixture of Experts (MoEs) can be traced back to the 1991 paper "Adaptive Mixture of Local Experts." Similar to ensemble methods, the concept was to employ a supervised approach for a system composed of individual networks, each handling a specific subset of the training cases. These separate networks, or experts, specialize in different regions of the input space. The selection of the expert is determined by a gating network, which assigns weights to each expert. During training, both the expert and the gating network are trained.

Between 2010 and 2015, advancements in two distinct research areas contributed to the later development of MoEs:

·?????? Experts as components: Traditionally, MoEs consisted of a gating network and multiple experts forming the entire system. However, researchers like Eigen, Ranzato, and Ilya explored MoEs as components within deeper networks. This approach allowed MoEs to function as layers within a multilayer network, enabling models to be both large and efficient simultaneously.

·?????? Conditional Computation: Traditional networks process all input data through every layer. In this period, Yoshua Bengio investigated methods to dynamically activate or deactivate components based on the input token.

These advancements laid the groundwork for exploring mixtures of experts within the realm of Natural Language Processing (NLP). Notably, the work by Shazeer et al. (2017), which included prominent figures like Geoffrey Hinton and Jeff Dean, along with Google's Chuck Norris, scaled this idea to a 137B LSTM (Long Short-Term Memory) architecture, which was the prevailing NLP architecture at the time, originally proposed by Schmidhuber. This work introduced sparsity to maintain fast inference even at high scale. While focusing on translation tasks, it encountered challenges such as high communication costs and training instabilities.

How does MoE work with transformers?

In the context of transformer models, a MoE consists of two main elements:

1.???? Sparse MoE layers are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 8), where each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs!

2.???? A gate network or router, that determines which tokens are sent to which expert. For example, in the image below, the token “More” is sent to the second expert, and the token "Parameters” is sent to the first network. As we’ll explore later, we can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs - the router is composed of learned parameters and is pretrained at the same time as the rest of the network.

领英推荐

Tech Trends to Watch: Large Language Models Ready to…

Analytics Insight? 2 个月前

RAG vs KAG: Comparison and Differences in GenAI…

Plain Concepts 1 个月前

Introduction to iAsk AI

Blockchain Council 10 个月前

Although MoEs provide benefits like efficient pretraining and faster inference compared to dense models, they also come with challenges:

·?????? Training: MoEs enable significantly more compute-efficient pretraining, but they’ve historically struggled to generalize during fine-tuning, leading to overfitting.

·?????? Inference: Although a MoE might have many parameters, only some of them are used during inference. This leads to much faster inference compared to a dense model with the same number of parameters. However, all parameters need to be loaded in RAM, so memory requirements are high. For example, given a MoE like Mixtral 8x7B, we’ll need to have enough VRAM to hold a dense 47B parameter model. Why 47B parameters and not 8 x 7B = 56B? That’s because in MoE models, only the FFN layers are treated as individual experts, and the rest of the model parameters are shared. At the same time, assuming just two experts are being used per token, the inference speed (FLOPs) is like using a 12B model (as opposed to a 14B model), because it computes 2x7B matrix multiplications, but with some layers shared (more on this soon).

What are the business benefits?

Mixture of Experts (MoE) LLM models have the potential to significantly impact businesses by improving both efficiency and performance in various tasks. Here's a breakdown of the key implications:

Reduced Training Costs: MoE models can achieve similar performance to monolithic LLMs with a fraction of the parameters, leading to faster and cheaper training processes. This translates to cost savings for businesses looking to develop or utilize LLMs.

Lower Computational Requirements: By distributing tasks among specialized experts, MoE models require less computational power for inference. This makes them ideal for deployment on resource-constrained environments or for real-time applications.

Task Specialization: Experts within the MoE architecture can be trained for specific tasks, leading to better performance in those areas compared to a general-purpose LLM. Businesses can leverage this to create LLMs tailored to their specific needs.

Flexibility and Adaptability: New experts can be added to an MoE model to address new tasks or improve performance in existing ones. This allows businesses to adapt their LLMs as their needs evolve.

Desh Deepak

Senior Project Manager at Infosys Ltd

10 个月

Great insights Syed !!!

Aiinfox

10 个月

Interesting insights Syed Q Ahmed, but practical application and scalability remain key considerations before widespread adoption. Looking forward to seeing how this technology evolves in real-world scenarios.

1 次回应

查看更多评论

要查看或添加评论，请登录

Syed Quiser Ahmed的更多文章

Independence Day Message: Leading the Way in the Age of AI with Heart and Responsibility

2024年8月14日

Independence Day Message: Leading the Way in the Age of AI with Heart and Responsibility

As we gather to celebrate Indian Independence Day, we’re reminded of the incredible journey our nation has taken, from…

4 条评论
Human Intuition beating Frame Problem in AI

2024年7月14日

Human Intuition beating Frame Problem in AI

We're bombarded with information from all directions even when we perform basic and simple tasks like getting a cup of…

9 条评论
Strong AI Procurement Policies are Required to Manage AI Risks

2024年5月19日

Strong AI Procurement Policies are Required to Manage AI Risks

Artificial Intelligence (AI) has become a cornerstone of modern business strategies, transforming industries and…

2 条评论
Move Over Transformers: The Next Evolution in AI Architecture Is Here!

2024年4月29日

Move Over Transformers: The Next Evolution in AI Architecture Is Here!

Will we see alternative paradigms to the transformer architecture? Transformer-based architectures have become the…

7 条评论
Unlearn to Learn: How Unlearning is going to be a crucial part of Responsible (AI) by Design?

2024年4月13日

Unlearn to Learn: How Unlearning is going to be a crucial part of Responsible (AI) by Design?

Machine unlearning is an emerging area within machine learning that focuses on eliminating the impact of a specific…

7 条评论
AI Crisis Management: A Comprehensive Guide

2024年1月24日

AI Crisis Management: A Comprehensive Guide

Background As part of our Responsible AI strategy, we need to establish an AI Crisis Management Guide to effectively…

2 条评论
Navigating the Complex Landscape of Responsible AI Systems

2024年1月20日

Navigating the Complex Landscape of Responsible AI Systems

Policies and regulations governing Artificial Intelligence (AI) have gained significant attention with the emergence of…

5 条评论
AI Policy Making: The Challenges

2023年11月12日

AI Policy Making: The Challenges

While the conversations around AI regulations are heating up in all major countries of the world, I would like to take…

1 条评论
Something Big is Cooking in the AI Regulatory Landscape: A Global Outlook

2023年10月27日

Something Big is Cooking in the AI Regulatory Landscape: A Global Outlook

In the next couple of weeks, the AI regulatory landscape is expected to undergo a profound transformation, potentially…

4 条评论
RAG: The dark horse that will revolutionize enterprise applications?

2023年10月10日

RAG: The dark horse that will revolutionize enterprise applications?

RAG is a highly underleveraged asset in the context of LLM application in the enterprise. One of the major constraints…

6 条评论

See all articles

A Mixture of Experts: A revolutionary technique to boost generative AI performance?

Syed Quiser Ahmed

Head of Infosys Responsible AI Office | Member of ISO SC42 for AI | NIST Primary POC for AI Safety | Member of Forbes Technology Council

领英推荐

Syed Quiser Ahmed的更多文章

社区洞察

其他会员也浏览了

From AI to AGI: The Journey to Generalized Machine Intelligence

How Retrieval-Augmented Generation (RAG) Helps Reduce AI Hallucinations

What is GraphRAG? Is it Better than RAG?

The Evolution of GPT: From GPT-1 to GPT-4o Mini

The Rise of Large Concept Models in Artificial Intelligence

NLP Transformers

The Evolution of Transformer Models: Breakthroughs in Self-Adaptation and Long-Term Memory with Transformer2 and Titans

Understanding transformers from first principles - #artificialintelligence #115

IMO Weekly Highlights - 02122024

The Rise of the Transformers: Explaining the Tech Underlying GPT-3

领英推荐

Syed Quiser Ahmed的更多文章

Independence Day Message: Leading the Way in the Age of AI with Heart and Responsibility

Human Intuition beating Frame Problem in AI

Strong AI Procurement Policies are Required to Manage AI Risks

Move Over Transformers: The Next Evolution in AI Architecture Is Here!

Unlearn to Learn: How Unlearning is going to be a crucial part of Responsible (AI) by Design?

AI Crisis Management: A Comprehensive Guide

Navigating the Complex Landscape of Responsible AI Systems

AI Policy Making: The Challenges

Something Big is Cooking in the AI Regulatory Landscape: A Global Outlook

RAG: The dark horse that will revolutionize enterprise applications?

社区洞察

其他会员也浏览了

From AI to AGI: The Journey to Generalized Machine Intelligence

How Retrieval-Augmented Generation (RAG) Helps Reduce AI Hallucinations

What is GraphRAG? Is it Better than RAG?

The Evolution of GPT: From GPT-1 to GPT-4o Mini

The Rise of Large Concept Models in Artificial Intelligence

NLP Transformers

The Evolution of Transformer Models: Breakthroughs in Self-Adaptation and Long-Term Memory with Transformer2 and Titans

Understanding transformers from first principles - #artificialintelligence #115

IMO Weekly Highlights - 02122024

The Rise of the Transformers: Explaining the Tech Underlying GPT-3