Meta's Llama 3.2 - Edge AI & Vision with Open, Customizable Models
Aditi Khare
AWS & AI Research [LLMs & Vision]-Principal Machine Learning Scientist & AI Architect | IIM-A | Author | Inference Optimization | Hyperspectral Imaging | Open-Source Dev | Build Production-Grade AI Products from Scratch
#ai #airesearch #meta #llm #genai #vision
Model evaluations -
Model Evaluation suggests that the Llama 3.2 vision models are competitive with leading foundation models, Claude 3 Haiku and GPT4o-mini on image recognition and a range of visual understanding tasks. The 3B model outperforms the Gemma 2 2.6B and Phi 3.5-mini models on tasks such as following instructions, summarization, prompt rewriting, and tool-use, while the 1B is competitive with Gemma.
This has been Evaluated performance on over 150 benchmark datasets that span a wide range of languages. For the vision LLMs has been evaluated performance on benchmarks for image understanding and visual reasoning.
Vision Models -
As the first Llama models to support vision tasks, the 11B and 90B models required an entirely new model architecture that supports image reasoning.
To add image input support, it is trained a set of adapter weights that integrate the pre-trained image encoder into the pre-trained language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the language model. We trained the adapter on text-image pairs to align the image representations with the language representations. During adapter training, we also updated the parameters of the image encoder, but intentionally did not update the language-model parameters. By doing that, we keep all the text-only capabilities intact, providing developers a drop-in replacement for Llama 3.1 models.
The training pipeline consists of multiple stages, starting from pretrained Llama 3.1 text models. First, we add image adapters and encoders, then pretrain on large-scale noisy (image, text) pair data. Next, we train on medium-scale high quality in-domain and knowledge-enhanced (image, text) pair data.
In post-training, we use a similar recipe as the text models by doing several rounds of alignment on supervised fine-tuning, rejection sampling, and direct preference optimization. We leverage synthetic data generation by using the Llama 3.1 model to filter and augment question and answers on top of in-domain images, and use a reward model to rank all the candidate answers to provide high quality fine-tuning data. We also add safety mitigation data to produce a model with a high level of safety while retaining helpfulness of the mode
The end result is a set of models that can take in both image and text prompts, and deeply understand and reason on the combination. This is another step toward Llama models having even richer agentic capabilities.
Lightweight Models -
2 Methods - Pruning & Distillation on the 1B and 3B models, making them the first highly capable lightweight Llama models that can fit on devices efficiently.
Pruning enables us to reduce the size of extant models in the Llama herd while recovering as much knowledge and performance as possible. For the 1B and 3B models, we took the approach of using structured pruning in a single shot manner from the Llama 3.1 8B. This involved systematically removing parts of the network and adjusting the magnitude of the weights and gradients to create a smaller, more efficient model that retains the performance of the original network.
Knowledge distillation uses a larger network to impart knowledge on a smaller network, with the idea that a smaller model can achieve better performance using a teacher than it could from scratch. For the 1B and 3B in Llama 3.2, we incorporated logits from the Llama 3.1 8B and 70B models into the pre-training stage of the model development, where outputs (logits) from these larger models were used as token-level targets. Knowledge distillation was used after pruning to recover performance.
In post-training, we use a similar recipe as Llama 3.1 and produce final chat models by doing several rounds of alignment on top of the pre-trained model. Each round involves supervised fine-tuning (SFT), rejection sampling (RS), and direct preference optimization (DPO).
领英推荐
In post-training, we scale context length support to 128K tokens, while maintaining the same quality as the pre-trained model. Also synthetic data generation that goes through careful data processing and filtering to ensure high quality. We carefully blend the data to optimize for high quality across multiple capabilities like summarization, rewriting, instruction following, language reasoning, and tool use.
Llama Stack distributions -
System Level Safety -
Try Meta's multimodal vision and lightweight models in Amazon Bedrock:
References -
Reference Reading Link - https://www.llama.com/
Hugging Face Link - https://huggingface.co/meta-llama
For more information on AI Research Papers you can visit my Github Profile -
For Receving latest updates on Advancements in AI Research Gen-AI, Quantum AI & Computer Vision you can subscribe to my AI Research Papers Summaries Newsletter using below link -
Thank you & Happy Reading !!