Sparse Mixture of Experts ???? Instruction tuning
Credits: Paper - Paper - https://arxiv.org/abs/2305.14705

Sparse Mixture of Experts ???? Instruction tuning

? A mixture of experts, Instruction tuning - two fascinating topics adding tons of value to the LLM landscape. Designing scalable, high-performant, efficient compute level models under the task agnostic learning setting.?

There is also a pattern of model distillation - Expert models distilled into smaller models. The simplified model acquires the wisdom of the expert model.?

Sparse Experts models - Adding learnable parameters to LLM without affecting inference cost (only a subset of parameters act on each example) - Relevant expert node in a layer is alone activated - for efficiency, scaling.?

Instruction Tuning: Making models follow instructions. Expert Models benefit from Instruction Tuning more than the raw dense models.?


?? Overall three experiments:

  1. Direct fine-tuning on a downstream task without instruction tuning (Task specific fine-tuned mixture of experts' models)?
  2. Instruction fine-tuning followed by in-context (zero, few shot generalization on downstream task)
  3. Instruction tuning is supplemented by fine-tuning on downstream tasks.


?? Experiment outcome: MoE models outperform dense models that went through the process of “Instruction tuning”?

Instruction tuning - is a much more realistic completion and it’s closely how humans complete the tokens.?

Ultimately - coming up with a scalable technique that uses the existing models but without much computational overhead.?Finding better ways to adapt or bridge general pretraining and task-specific fine-tuning. Measuring the above three experiments and their impact on MoE models. Improve task-specific performance (instruction tuning) and also use the compute efficiently (sparse mixture of experts)?

Models to account: FLAN-MOE vs FLAN-PALM?

Model Architecture: Transformer layers are replaced with MoE layers. Each expert represents a simple linear model but we are choosing only a small subset of ‘activated’ experts. A softmax kind of gating function models a probability over all these experts. Routing strategy to efficiently distribute the input data across specialized experts. This is interesting -> let tokens select the top-k tokens, and let experts select the top-k tokens. (Differentiating expert vs routing level) -> computationally efficient (No of GFLOPs per token prediction)?


Points to note:?

  1. It might limit computation but there is still an edge for the transformer layer (densely activated holding much broader contextual awareness). The effect of Instruction fine-tuning on MoE models is giving a boost in performance. Scales are better for more tasks without more experts into account.?
  2. Long-range dependencies in sparse experts models.?
  3. (Finetuning) There is a strategy of loss called auxiliary loss, Balancing loss, router Z-loss - Diversifying the expert knowledge and preventing overfitting. Expert/Gating Freeze, Experiments on hyper-parameter sensitivity.
  4. MoE architectures are prone to overfit when compared to dense counterparts.?

?? Evaluation is based on the different instruction tasks. Thinking about what it means for a model to saturate with experts.?

Philip has an amazing extended guide for instruction tuning - LLAM2. Instruction tuning seems to have a critical role in downstream tasks and various MoE models seem to have a positive learning curve from instruction tuning vs dense models. Instruction tuning seems to make the model better adapt to perform text generation towards a goal, improving the dialog nature of the raw model.?

要查看或添加评论,请登录

Sangeetha Venkatesan的更多文章

社区洞察

其他会员也浏览了