Mixture of Experts & Sparse Mixture of Experts in Superintelligence

Mixture of Experts & Sparse Mixture of Experts in Superintelligence

Rob Smith, Executive Director AI & Risk, Dark Architect


This is an excerpt from a chapter in the latest Tales From the Dark Architecture III?—?Building AGI & Superintelligence book that continues the TFDA series and the Artificial Superintelligence Handbook series. This series of books available, globally on Amazon, is a peek inside the dev labs at the roadmaps and whiteboards and inside the minds of those who build the most cutting edge AI. Many of the concepts and designs in this book are fundamental to building generative AI, Artificial General Intelligence (AGI) and beyond to Artificial Superintelligence (ASI) but some are novel to our own Cognitive AI lab or the work of our clients.

https://a.co/d/8KrZ1og

Will Superintelligence end humanity? No one knows for certain but what is certain is that Superintelligence has the capacity to improve our world.

We need not fear Superintelligence, we need only fear those who build and fund it and what lies in their hearts, their true motivations and the extent of their intelligence.

This chapter looks at designs for Mixture of Experts in Artificial General Intelligence and Artificial Superintelligence and builds a foundation for the following chapter on AI Agents and distributed Superintelligence



In order to understand the designs and role of expert architectures in advanced AI and its extensions into Mixture of Experts (MoE) we need to begin with the development and design that instigated it all or the transformers used in Large Language Models (LLM’s) and diffusion driven image classification models. The goals of these models are to comprehend language in order to communicate like a human using natural language processing and to comprehend visual and other perceptions for classification using computer vision and other perceptive architectures. I have written extensively about LLMs and computer vision in the TFDA series and the foundations of such in the ASIH series of books so I won’t rehash the detail of the content. The high level view is that LLM’s use a corpus or index of words and vast volumes of text and speech in transformer based neural net architectures (see the chapters in this book on transformers & math or in other content I have released on building and using transformers in AI).

These systems apply these architectures to comprehend the relationship between words and the use of words as contextual relevance and to persist this learning within matrices of weights (parameters) for application against new stimuli such as an input or prompts. The systems learn how to interpret language to produce a response to a prompt essentially mimicking the human use of the language to an exceptional degree. However we must never forget that these machines don’t see words, they see numbers or more accurately binary states. The same ‘foundation’ of deep learning methods (math) are also used by AI systems to recognize and learn images and their ‘classifications’ and produce novel image outputs based on natural language stimuli (e.g. a language prompt). Sound is also ‘learned’ and produced (i.e. as a response) to stimuli most often in the form of a language prompt. Note that older SoundAI and ImageAI rely on annotations or the application of a text ‘description’ pairing for training elements while newer architectures are applying context from LLMs and novel layered annotation to training meta information as well as multi modal augmentation structures (one prompt for multiple in context responses and flows such as songs with lyrics and singing or comprehensive multi modal video streams). There are certainly a plethora of other ‘prompts’ now being implemented in advanced AI architectures including direct ‘communication’ from other AI systems such as agent to agent sub task distribution and planning of the kind seen in newer robotics and self driving cars, tasks distribution and recompilation, etc. This is the starting point for the blending of MoE and agency architectures that extend LLM’s as both instructions for expert agents and now as methods of reflection and reasoning by those same expert agents (more on agents, expert agents and Mixture of Agent Models or MAM and agency architectures in the next chapter).

Mixture of Experts is an architecture that is applied in machine learning and generative AI, including LLMs, to distribute and focus effort and action using multiple specialized trained models applied to specific aspects of diverse input stimuli. Think of a bunch of players each with unique training, skills and experience in a MMO game navigating the game at the same time and communicating their progress and learning to the other players on their team for coordination of the next moves or tasks toward a goal. MoE is similar in that experts arise due to the different mixtures of skills, experience and training and with the most optimal expert chosen to respond to the next state.

MoE has its beginning in very early machine learning research with papers released on early expert systems in 1991 (Jacobs et al), however others had been researching machine learning and generalization before this point including in dev labs I was subsequently involved in later in the decade as part of the Fintech industry and black box hedge trading. Since then hundreds of subsequent papers on variations and innovations to expert systems have been written and extensive research conducted. The goal of this early research was to produce a modified weight matrix that embeds a bias specific to an individual expert laying the foundation of today’s MoE systems. These biases are derived from context layers or windows to permit the systems to focus on or filter data to manage attention or derived knowledge based on relevance. In this way the parameters are adjusted to prioritize a subset of training data and learning as expertise. This vastly improves compute overhead while achieving better attention and model responsiveness.

We see a similar specialization (or bias) in LLMs that were trained on specific data sets (i.e. medical text) or subsets of data for purpose (the goal of the original research). However the architecture has morphed for other ‘value’ than just distribution of attention or assignment of weight to an expert’s prediction (two models of MoE). Other goals such as optimization, customization, resource efficiency, generalization, etc., are now part of the MoE architecture. The foundation of these structures is the gating architecture or network which is a feedforward neural network that learns to assign the appropriate experts to the input space. The key is training the gating network with the experts (also neural networks) so the MoE learns which experts to apply to input. The mixture layer uses the gating network expert weights to combine the outputs of the experts.

A fundamental foundation of MoE systems is that specialized learning and knowledge is distributed over nodes (experts) and is then applied to inputs (e.g. text prompts) for subsequent processing by an appropriate specialist. This saves time and resources in learning and optimizing output. Humans develop over time specialized knowledge in certain areas even to the point that we become experts in a specific subset of all the data, stimuli and response in our world. As an expert, we can share this expertise with others and when we wish to perform an action in response to a stimuli that we are not trained or experienced in, we employ an appropriate specialist to save time and resources in the learning and output cycles. You can think of this as hiring someone who is trained in an area like plumbing, medicine or car repair to fix your sink, back or vehicle instead of you chewing through resources to learn all these trades so you can fix your own problems. Instead we focus our resources on specific tasks. Where the benefit resides for us humans is that we can optimize or focus our resources on things ‘we do best’ and let other experts work on other things they do best. Then we can combine efforts to create a unified response to the stimuli of our world with better results and greater optimization and efficiency. This is the foundation of expert models with the ‘mixture of expert’ component a control mechanism for selecting experts based on their relevance to the input.

This is the entire essence of expert systems including Mixture of Experts, Sparse Mixture of Experts, etc. It also provides the dimensionality necessary to perceive generalization that is essentially the perception of ‘high dimensional’ or dense context using less dimensions of perception. This is achieved by comprehending the variance between ‘sets’ of perception (e.g. data) to surface consistencies between the sets. In your head you should begin to see a connection between variant expert knowledge and finding a generalization to represent the combined knowledge at a higher (or less dimensional) level of comprehension.

All of this implies that weighting functions are used to combine expert response and output, however some newer developments are breaking this original design and evolving the architecture into network structures of relevance (i.e. the experts are automatically applied where most optimal with novel weight blending and prioritization) based on deep contextual comprehension by the system using layers of context over dimensions (e.g. time steps). This is creating a foundation for a synapse style general AI that optimizes not just the path to output but the blend of experts (nodes in an intelligence network) chosen to get there. It also permits robust balancing mechanisms, superalignment designs, interpretability, trust mechanisms, etc., all consistent with human intelligence and behavior. This is also opening novel designs for AI interoperability and AI oversight (i.e. AI systems monitoring other AI systems) that I discuss elsewhere.

Input Space vs Training Data vs Subsets vs Parameters

One of the tricky parts of MoE is understanding the difference between ‘input space’, training data, subsets, parameters and sub parameters. The ‘input space’ is the set of all possible input data available to the model to process. The ‘input space’ is defined by both available data and the system architecture, for example an AI that implements MoE for video will fail if the system cannot perceive video. What the actual data is specifically in the ‘input space’ is variant and doesn’t need to be words for an LLM to process. It can be any type of data such as labeled images, raw data from a company, state progressions, other forms of specialized data like medical data or imaging, scientific data, etc. However what is important is that the data is focused on by one or more experts. If this is not the case, then expertise is not developed by the training and becomes simply a cognitive contextual classification trick with no capacity for generalization. We humans are an ‘expert’ in an area based on our knowledge through experience and learning of the relationships and relevance of all elements within our ‘domain of expertise’ and our ability to interpret and respond (act) on a stimuli using such expertise even if we have never seen the input before, as long as the input is within our expertise domain. At its core, expertise is contextual even when it is physical (i.e. an athlete) and at its foundation, all expertise is optimized contextual stimuli/response.

The function of an MoE architecture is to apply the appropriate expert to a generalized input (stimuli) based on an assessment of the input characteristics (i.e. it’s relevance to a context). Although not exclusively applied to LLMs, in most cases today of expert systems we see some level of application to NLP processing. This makes sense given that ‘context’ is inherent in LLMs by default even though we are just beginning to apply it in long attention structures and for deeper artificial cognition. The power of the ‘mixture’ component of the architecture is the application of more than one expert to learn something in a variant way and then bring the experts together to produce a consensus. The alternative is to just get a bunch of entities to learn all input data and then all try to reach a bulk consensus. This is the equivalent of a group of people subdividing a task as opposed to each individual working on the entirety of a task as a whole (i.e. task being ‘learning and response’). Clearly MoE is a more efficient architecture, although admittedly the ‘mass model’ will produce more comprehensive and generalized results across a single expertise, and the inherent cognitive efficiency in MoE is why lots of AI companies are pursuing edge agency (see the chapter on agents). In MoE there is a trade-off of comprehensive and deep generalized expertise across domains (one expert to rule them all) vs distribution and its inherent gaps across knowledge (the specialization trap).

Getting back to data, To be a true expert, one must have either direct experience or be trained on data specific to the expertise (focus and attention or in newer models state persistence). This is simplistic when we think of using data specific to a domain like medical scans or medical results (i.e. bloodwork). Experts in this case can be applied in the same way as any deep learning generative model to develop output specific to the domain of expertise (i.e. they develop their own parameters). However there are flavors to this model when large data sets are used for training experts or where subsets of existing learning is relevant. This last point is the element of specialization with a learned set of parameters. It is important to note that LLMs and MoEs are different flavors of the same structures or layers of neural nets, etc., and while each achieve different results and present unique opportunities, they are distinct. However MoE models are applied as an expert variant LLM (i.e. Mixtral). This doesn’t mean the two cannot be combined and they are. Just like humans apply a subset of language to express their expertise, MoE architectures lever LLMs for advantage especially in contextual comprehension (i.e. context windows and context tokens) and adaptation or evolution of the kind used in AI agent architectures.

Subsets of existing parameters (weights and biases) are a relatively new model whereby the output parameters of one model are levered by applying experts to a subset of the parameters (focus or mask) for the purposes of ‘fine tuning’. Newer fine tuning architectures including parameter masking and sparsity have immense future potential in areas such as generalization in AGI. It is important to note that expertise can form through the selection of training data or it can form through specialization of training where a layer of training is used to focus further training. I will use both elements in this content relatively interchangeably so be aware. I will also use another concept called masking which is used in parameter fine tuning. Masks are applied in computer vision (convolution) and other areas but are also applied in very advanced systems as a method of attention. We see this in humans when we block out stimuli, learn to focus attention or explore beyond the reach of our learning. I often meet and work with truly unlikable people and if I do not mask my own views, opinions and inner personality, I will neither find common ground nor attain any benefit from my interaction with them. Instead I use masking to seek more positive value from interactions and deprecate any negative values of the interaction. We all do this to some degree (and I wish I had done it more in life).

In newer MoE structures, the concept of masking or parameter activation is important because it is a fundamental part of low resource, near instant human cognition and the mechanism that permits this is probabilistic intent (i.e. how we find cognitive needles in the vast haystack that is our memory). It is also fundamental to sparsity. To be certain, masking has two sides. Initially you might think of masking as applying a boundary but in fact masking is essentially a ranking mechanism with ranks below a single parameter excluded for a particular point of cognitive presence (this does not imply it won’t change over time). We can lift the mask when optionality fades or peek from under the mask for nearby relevance. In both cases, masks are fluid to some degree and driven by context. This is critical in attention and we have all experienced this when we carry on a conversation. The mask causes us to focus on relevant data but if the situation warrants it, the ability to remove some or all of the mask to extend cognition is always available.

Designing Mixture of Experts

The design of MoE systems is not hard to comprehend at all if you can understand today’s generative AI systems (detailed in the TFDA series). These generative systems simply train on data such that they form a matrix of weights and biases (parameters) that reflect abstractions of the real world (relationships and relevance) and transforms those abstractions into other abstractions ‘for purpose’ (i.e. comprehending then responding to an input stimuli or prompt in a contextually consistent manner). As mentioned earlier, at their core, large language models (LLMs) are trained on vast collections of writing or speech content against a corpus of language similar to using a dictionary to understand the context and association of words in any language, except the machines use existing examples of language to determine the ‘correct use’ of words and do not directly comprehend or store in memory ‘meaning’ like we humans do. However using this technique they can mimic this ‘understanding’ to such a degree that it is nearly impossible to discern the difference. This is because generative AI systems perform a type of inference (the foundation of multiangulation noted elsewhere) from the writing samples as to the meaning or context of words or groups of words. This permits them to even learn a ‘meaning’ without a primary context corpus as long as they can infer a context or association (relationship) and the relevance of words to each other (i.e. within an input prompt or output).

In this way AI systems do not need to ‘see’ something to comprehend it. However even AI comprehension can be optimized when perception is combined with annotation or the labeling of perceptions or actions with words that can be related back to an LLM or even better to a Large Contextual Model or ‘window’. This is where multi-modality is the gateway to perceptive generalization since we humans learn generality via all forms of sensory stimuli and especially faster with visual perception. This training can then be applied for purpose such as interpreting a stimuli or text input string and formulating a text output response as sound, images or video, etc. The systems achieve this by understanding words and perceiving their ‘use’ to form a matrix of weights as comprehension of both relationship (associations of words) and relevance (probabilistic occurrence) and as a foundation to apply this to the weights needed to produce a multimodal response (interpreting text context to create a video of streaming image states). All perceptive output is derived from context of the kind encapsulated in language. The systems use these patterns as ‘knowledge’ to respond in some way (prediction) to a stimuli. The most known of these is the application of the LLM models to language interpretation and response and it is fairly basic to almost anyone who has used ChatGPT or any other online generative AI or who has ever had a conversation in life.

The downside of the very early original designs of deep learning AI were that they were immensely costly and required vast compute resources to get the job done but the ability to comprehend basic language by machines (encapsulated in parameters) is now relatively complete with the only optionality left being more training data and deeper structures such as context or the depth of sensory modality (i.e. video content). This means the word weight matrices are relatively set except for new variations, unknown content and layers for depth of comprehension or specialization optimization. However to run these older larger structures against any and all stimuli still requires exceptional resources although the demand and cost of these resources has dramatically fallen. So that begs the question, what if you don’t need or want to process ‘every’ word in a language? What if you just want some legal or medical advice? Do you really need the details of the language of mechanical engineering or physics? Or what if you require math input? Is the detail of all of Shakespeare’s work really necessary or efficient? The answer is no and worse, it’s vastly inefficient both in learning and application.

What is really needed in these scenarios is an expert in the field for the information you seek at a particular ‘point of cognitive presence’. You ask a doctor for medical advice, a lawyer for legal advice and an engineer when you decide to build a giant bridge in your backyard. You don’t mix them up unless you need information that crosses contexts like suing your doctor for botching your operations, or suing your engineers for botching the bridge build. In this case you need a mix of experts but you still don’t need an expert on Hollywood culture. However it is also valuable to have variance within the expertise. I employ a team of lawyers of slightly variant legal expertise when I require lawyers. Keep this notion of ‘division of labor’ in mind as the content continues.


Also included in this chapter:

Conditional Computation

Two Attention Heads are Better Than One

Building MoE

Sparse Mixture of Experts

The Future of Mixture of Experts

要查看或添加评论,请登录

Rob Smith的更多文章

社区洞察

其他会员也浏览了