ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Sparse Mixture of Experts ???? Instruction tuning

Sangeetha Venkatesan

NLP Engineer | Information Retrieval | Insurance domain | RAG

å‘å¸ƒæ—¥æœŸ: 2023å¹´8æœˆ2æ—¥

? A mixture of experts, Instruction tuning - two fascinating topics adding tons of value to the LLM landscape. Designing scalable, high-performant, efficient compute level models under the task agnostic learning setting.?

There is also a pattern of model distillation - Expert models distilled into smaller models. The simplified model acquires the wisdom of the expert model.?

Sparse Experts models - Adding learnable parameters to LLM without affecting inference cost (only a subset of parameters act on each example) - Relevant expert node in a layer is alone activated - for efficiency, scaling.?

Instruction Tuning: Making models follow instructions. Expert Models benefit from Instruction Tuning more than the raw dense models.?

?? Overall three experiments:

Direct fine-tuning on a downstream task without instruction tuning (Task specific fine-tuned mixture of experts' models)?
Instruction fine-tuning followed by in-context (zero, few shot generalization on downstream task)
Instruction tuning is supplemented by fine-tuning on downstream tasks.

?? Experiment outcome: MoE models outperform dense models that went through the process of â€œInstruction tuningâ€?

é¢†è‹±æŽ¨è

All the ways to deploy an ML model

Damien Benveniste, PhD 1 å¹´å‰

The Evolution of Machine Learning: The Birth of MLOps

Rajesh Kumar 3 ä¸ªæœˆå‰

MLOps

Rohit Singh 3 ä¸ªæœˆå‰

Instruction tuning - is a much more realistic completion and itâ€™s closely how humans complete the tokens.?

Ultimately - coming up with a scalable technique that uses the existing models but without much computational overhead.?Finding better ways to adapt or bridge general pretraining and task-specific fine-tuning. Measuring the above three experiments and their impact on MoE models. Improve task-specific performance (instruction tuning) and also use the compute efficiently (sparse mixture of experts)?

Models to account: FLAN-MOE vs FLAN-PALM?

Model Architecture: Transformer layers are replaced with MoE layers. Each expert represents a simple linear model but we are choosing only a small subset of â€˜activatedâ€™ experts. A softmax kind of gating function models a probability over all these experts. Routing strategy to efficiently distribute the input data across specialized experts. This is interesting -> let tokens select the top-k tokens, and let experts select the top-k tokens. (Differentiating expert vs routing level) -> computationally efficient (No of GFLOPs per token prediction)?

Points to note:?

It might limit computation but there is still an edge for the transformer layer (densely activated holding much broader contextual awareness). The effect of Instruction fine-tuning on MoE models is giving a boost in performance. Scales are better for more tasks without more experts into account.?
Long-range dependencies in sparse experts models.?
(Finetuning) There is a strategy of loss called auxiliary loss, Balancing loss, router Z-loss - Diversifying the expert knowledge and preventing overfitting. Expert/Gating Freeze, Experiments on hyper-parameter sensitivity.
MoE architectures are prone to overfit when compared to dense counterparts.?

?? Evaluation is based on the different instruction tasks. Thinking about what it means for a model to saturate with experts.?

Philip has an amazing extended guide for instruction tuning - LLAM2. Instruction tuning seems to have a critical role in downstream tasks and various MoE models seem to have a positive learning curve from instruction tuning vs dense models. Instruction tuning seems to make the model better adapt to perform text generation towards a goal, improving the dialog nature of the raw model.?

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Sangeetha Venkatesançš„æ›´å¤šæ–‡ç«

ColPali: Document Retrieval with Vision Language Models

2024å¹´8æœˆ10æ—¥

ColPali: Document Retrieval with Vision Language Models

ColPali Idea: (Faster, Capable, Trainable) - Work natively with documents instead of text extraction. Source:â€¦
Trials of Textual Tomfoolery! Pick a deck of "words" and find a word to "deal".

2023å¹´8æœˆ11æ—¥

Trials of Textual Tomfoolery! Pick a deck of "words" and find a word to "deal".

Though I love the "usefulness" of the large language model, I really like the "creative analogies" the model comes upâ€¦
Datasets - Latent Space - old content is where the medium of new content lies

2023å¹´7æœˆ29æ—¥

Datasets - Latent Space - old content is where the medium of new content lies

AI Datasets: Latent Space hosted an insightful podcast discussing datasets. No, GPT-3 was not trained on the entireâ€¦
Lost in the Middle - Usage of the context window

2023å¹´7æœˆ22æ—¥

Lost in the Middle - Usage of the context window

?? Lost in the middle: Usage of context window by LLM The shift of context from 4K, 8K, 16K, 32K, 100K, and even 1â€¦
What's being generated?

2023å¹´5æœˆ14æ—¥

What's being generated?

???? I spend my walks in Syracuse (very peaceful in summer) hearing to these podcasts, I sometimes hear in the loop toâ€¦
Conversational perspectives on chatbots

2023å¹´1æœˆ4æ—¥

Conversational perspectives on chatbots

This sums up the range of work on chatbots â€œThere is something deeply moving about creating something and having a chatâ€¦
Embedding Similarity Measures ?? ??

2022å¹´10æœˆ21æ—¥

Embedding Similarity Measures ?? ??

?? Came across a post from Prithivi Da, that instilled me to look more into embeddings, and how vector embeddings areâ€¦

1 æ¡è¯„è®º

See all articles

Sparse Mixture of Experts ???? Instruction tuning

Sangeetha Venkatesan

NLP Engineer | Information Retrieval | Insurance domain | RAG

é¢†è‹±æŽ¨è

Sangeetha Venkatesançš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

MLflow and Databricks as a Comprehensive Solution to AI/ML Workflows

How Machine Learning Operations Can Enhance Your Business?

Knowledge Graphs: Today's triples just ain't enough

Reasonance: Shaping the Future of End-to-End Machine Learning Workflows

How MLOps Improves the Lifecycle of Machine Learning Models

Demystifying the Machine Learning Development Life Cycle: A Comprehensive Guide

Machine learning operations: How to enhance AI projects and boost value streams

AutoML (Automated Machine Learning)

Unlocking the Power of MLOps: The Future of Machine Learning Deployment

The Business Impact of OpenAI's O3 Models: A Game-Changer for 2024

é¢†è‹±æŽ¨è

Sangeetha Venkatesançš„æ›´å¤šæ–‡ç«

ColPali: Document Retrieval with Vision Language Models

Trials of Textual Tomfoolery! Pick a deck of "words" and find a word to "deal".

Datasets - Latent Space - old content is where the medium of new content lies

Lost in the Middle - Usage of the context window

What's being generated?

Conversational perspectives on chatbots

Embedding Similarity Measures ?? ??

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

MLflow and Databricks as a Comprehensive Solution to AI/ML Workflows

How Machine Learning Operations Can Enhance Your Business?

Knowledge Graphs: Today's triples just ain't enough

Reasonance: Shaping the Future of End-to-End Machine Learning Workflows

How MLOps Improves the Lifecycle of Machine Learning Models

Demystifying the Machine Learning Development Life Cycle: A Comprehensive Guide

Machine learning operations: How to enhance AI projects and boost value streams

AutoML (Automated Machine Learning)

Unlocking the Power of MLOps: The Future of Machine Learning Deployment

The Business Impact of OpenAI's O3 Models: A Game-Changer for 2024

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†