Sparse Mixture of Experts ???? Instruction tuning
Sangeetha Venkatesan
NLP Engineer | Information Retrieval | Insurance domain | RAG
? A mixture of experts, Instruction tuning - two fascinating topics adding tons of value to the LLM landscape. Designing scalable, high-performant, efficient compute level models under the task agnostic learning setting.?
There is also a pattern of model distillation - Expert models distilled into smaller models. The simplified model acquires the wisdom of the expert model.?
Sparse Experts models - Adding learnable parameters to LLM without affecting inference cost (only a subset of parameters act on each example) - Relevant expert node in a layer is alone activated - for efficiency, scaling.?
Instruction Tuning: Making models follow instructions. Expert Models benefit from Instruction Tuning more than the raw dense models.?
?? Overall three experiments:
- Direct fine-tuning on a downstream task without instruction tuning (Task specific fine-tuned mixture of experts' models)?
- Instruction fine-tuning followed by in-context (zero, few shot generalization on downstream task)
- Instruction tuning is supplemented by fine-tuning on downstream tasks.
?? Experiment outcome: MoE models outperform dense models that went through the process of “Instruction tuning�
领英推è
Instruction tuning - is a much more realistic completion and it’s closely how humans complete the tokens.?
Ultimately - coming up with a scalable technique that uses the existing models but without much computational overhead.?Finding better ways to adapt or bridge general pretraining and task-specific fine-tuning. Measuring the above three experiments and their impact on MoE models. Improve task-specific performance (instruction tuning) and also use the compute efficiently (sparse mixture of experts)?
Models to account: FLAN-MOE vs FLAN-PALM?
Model Architecture: Transformer layers are replaced with MoE layers. Each expert represents a simple linear model but we are choosing only a small subset of ‘activated’ experts. A softmax kind of gating function models a probability over all these experts. Routing strategy to efficiently distribute the input data across specialized experts. This is interesting -> let tokens select the top-k tokens, and let experts select the top-k tokens. (Differentiating expert vs routing level) -> computationally efficient (No of GFLOPs per token prediction)?
Points to note:?
- It might limit computation but there is still an edge for the transformer layer (densely activated holding much broader contextual awareness). The effect of Instruction fine-tuning on MoE models is giving a boost in performance. Scales are better for more tasks without more experts into account.?
- Long-range dependencies in sparse experts models.?
- (Finetuning) There is a strategy of loss called auxiliary loss, Balancing loss, router Z-loss - Diversifying the expert knowledge and preventing overfitting. Expert/Gating Freeze, Experiments on hyper-parameter sensitivity.
- MoE architectures are prone to overfit when compared to dense counterparts.?
?? Evaluation is based on the different instruction tasks. Thinking about what it means for a model to saturate with experts.?
Philip has an amazing extended guide for instruction tuning - LLAM2. Instruction tuning seems to have a critical role in downstream tasks and various MoE models seem to have a positive learning curve from instruction tuning vs dense models. Instruction tuning seems to make the model better adapt to perform text generation towards a goal, improving the dialog nature of the raw model.?