Mixed Modal FM ??- Chances of Llama 4 | Aya 23 - Successor of Aya 101 ???
Raghul Gopal
Data Science at Logitech | AWS Community Builder ??(ML & GenAI) | Talks and Writes about AI, AGI & Cloud Deployments of AI & AGI | Public Speaker ??| Blogger ??| Unlocking Data Secrets with Math & AI ????
1.?????? Chameleon – Mixed – Modal Early Fusion Foundational Model
As you might hear the news that the Chameleon is the next Llama, say Llama 4? I say, yes, it is because of the nature of Mixed – Modality of interleaving images and text. Chameleon is the early fusion token-based mixed–modal model capable of understanding and generating images and text in any arbitrary sequence. It does Visual Question Answering, Image Captioning, Text Generation, Image Generation, and long-form mixed modal generation. Chameleon outperforms Llama2 text-only tasks while being competitive with models such as Mistral 8x 7b and Gemini Pro, and outperforms non–trivial image generation (All in a Single Model).
Recent Multimodal foundational models are widely adapted but still model different modalities separately. They often use modality-specific encoders and decoders. This can limit their ability to integrate information across modalities and generate (multi-modal documents), which includes both images and text. To mitigate this, mixed–modal models are released namely Chameleon.
Now, let’s have a look at what Chameleon has
1.?????? Fully token-based representations for both images and text modalities
2.?????? By quantizing images into discrete tokens, analogous to words in a text, we can apply the same transformer architecture to sequence both image and text tokens without the need for separate image/text encoders or domain-specific decoders.
3.?????? However, it also represents significant technical challenges particularly in terms of optimization stability and scaling. The researchers solved these challenges through a combination of Architectural Innovation and Training Techniques. For eg. Query Key Normalization and Revised Placement of Layer Norms.
Chameleon 34B achieves state-of-art performance outperforming models like Flamingo, IDEFICS, and Lava 1.5. It also achieves competitive performance on text-only benchmarks matching models like Mistral 8x 7B, and Gemini Pro. Chameleon unlocks entirely new capabilities in terms of mixed–modal reasoning and generation. Chameleon 34B achieves a 60.4% preference rate against Gemini Pro, and a 51.4% preference rate over GPT 4V in a pairwise comparison.
Let’s see some architectural features of Chameleon 34B below.
Tokenization – For Image Tokenization, a new image tokenizer has been used which encodes 512 x 512 images into 1024 discrete tokens from a Codebook of size 8192. In this, only licensed images have been used. The weakness of this method is the reconstruction images with a large amount of text when it comes to heavy OCR-related tasks.
BPE Tokenizer – BPE Tokenizer has been used over a subset of the training data outlined below with a vocabulary size of 65,536 which includes 8192 Image Codebook tokens, using the Sentence Piece Library.
Data Pre-training – The pretraining has been divided into two stages namely 80% of training data to stage 1, and 20% of training data to stage 2. For all text-to-image pairs, the researchers rotated so that 50% of the time the image comes before the text.? In stage 1, a combination of pre-training data is used to train Llama2 and CodeLlama. The text-image publicly available data sources and licensed data are used for images, and then resized and cropped into 512 x 512 images for tokenization.
Architecture of Chameleon – Largely follows Llama2 Architecture. It used the RMS Norm for normalization and uses SwiGLU as an activation function and Rotary Positional Embedding (RoPE). Standard Llama architecture showed complex divergence due to slow norm growth in the mid to late stage of training. It has narrowed down the problem to the softmax operation.
It is because sharing all the weights across modalities each modality will try to compete with each other by increasing its norm slightly. In the beginning, it is not a big problem but it gets diverged once we get outside the effective range of bf 16. Softmax Operation is applied in two places in a transformer. In Core Attention Mechanism, and Softmax over the logits.
Chameleon first deviates from Llama Architecture by using QK Norm (Query Key Normalization). The QK Norm controls the norm growth of the input to the softmax by applying the norm to the query and key vector within the attention. If you want to stabilize Chameleon 7B by controlling the Norm Growth, it is necessary to introduce Dropout after attention and feed-forward layers in addition to the QK Norm. This recipe was not used to stabilize the 34B Model. The benefit of the Swin Transformer strategy is that it bounds the norm growth of the feed-forward block which can become additionally problematic.
H = x + attention_norm(attention(x))
Output = H + ffn_norm(feed_forward(H))
H = x + attention(attention_norm(x))
Output = H + feed_forward(ffn_norm(x))
The first two equations support the Chameleon 34B, and the next two equation support the Llama 2 Model.
Note: For Chameleon 34B, QK Norm is needed without Dropout. For Chameleon 7B, QK Norm is needed with Dropout. It can be treated without Dropout when using norm–reordering.
AdamW is used as an Optimizer with b1 = 0.9, b2 = 0.95, and E = 10^-5
Note: The application of QK Norm while helping the inner softmax within the transformer does not solve the problem of logit shift in the final softmax. So, to mitigate this, Z – Loss Regularization is applied. It is very important to know that for Chameleon 7B, both dropout and Z–loss has been used for stability. For Chameleon 34B, Z – Loss Regularization is only used.
In Supervised Fine-tuning, the categories namely Text, Code, Visual Chat, Image Generation, Interleaved Text / Image Generation, and Safety. For Data Safety which potentially provokes the model to produce unsafe data and match them with a refusal response, I can’t help with that which controls sensitivity topics namely violence, controlled substances, privacy, and sexual content.
Access the paper using this link: https://arxiv.org/abs/2405.09818
2.?????? Aya 23 – Family of Multilingual Language Models by Cohere for AI.
Aya 23 covers 23 languages. Initially, Aya 101 covered 101 languages whereas Aya 23 is an experiment in depth vs breadth exploring the impact of allocating more capacity to fewer languages that are included during pre-training. Aya 23 outperforms both previous massively multilingual models like Aya 101 if covered and models like Gemma, Mistral, Mixtral, etc.
领英推荐
Aya 23 is based on the Cohere Command Model and the Aya Multilingual style collection. Two major hurdles in the development of powerful multilingual models are,
·???????? The lack of robust multilingual pretrained models
·???????? The Scarcity of instruction–style training data covering the diverse set of languages.
To mitigate these hurdles, Aya 101 has been released. But, the mitigations behind Aya 101 are outdated knowledge and inadequate performance. Furthermore, Model Aya 101 has a 13 billion parameter model designed for breadth to nearly double that achieved by previous models like 101 languages.
Aya 23 is now available in two model sizes namely 8 billion, and 35 billion. Please note that, Aya 23 35B is trained on Cohere Command R Model.
Let’s take a look at the architecture.
·???????? It is a Standard Decoder–only transformer.
·???????? It has parallel attention & FFN layers. The parallel attention is similar to PaLM 2 parallel block architecture that leads to significant improvement in training efficiency without hurting model quality, especially in tensor parallel TP–settings.
·???????? Uses SwiGLU Activation Layer
·???????? It has no biases which is similar to PaLM 2 – removed all biases from dense layers to improve the training stability.
·???????? Uses Rotary Positional Embeddings (RoPE)
·???????? Uses BPE Tokenizer of 256K and uses NFC Normalization and digits split into tokens.
·???????? Uses Grouped Query Attention (GQA)
All base models are trained using FAX – A Jax-based distributed training framework on TPU V4 Chips.
Let’s take a look at the Data Mixtures in Aya 23.
· Multi-lingual Templates – XP3X Dataset, Data Provenance Collection, and the Aya Collection.
·???????? Human Annotations
·???????? Translated Data – Translations of Hotpot QA, and Flan – COT Submix
·???????? Synthetic Data – from ShareGPT and Dolly 15K.
For instruction finetuning, finetuned the base models for 13,200 update steps using an 8192 context length. Adam Optimized with a Cosine Schedule learning rate has been used with a peak LR of 6x10-4 and an end LR of 6x10-5, and a batch size of 64. For all training runs, they used TPU V4 with upto 128 pod slices.
Let’s take a look at how they evaluated the multilingual datasets. The Multilingual Evaluation Techniques are the same as those used in Aya 101 but eval harness has been used as another evaluation measure.
1.?????? Completely unseen Discriminative Tasks – Xwinograd, XCOPA, and XStoryCloze
2.?????? General-Purpose Language Understanding – Multilingual MMLU
3.?????? Multilingual Mathematical Reasoning – Multilingual Grade School Math (MGSM)
4.?????? Generative tasks – Machine Translation and Summarization on FLORES 200 and XLSum
5.?????? Preference Evaluation – Open-ended generation Capabilities of the model through human and LLM simulated evaluation using 1. Dolly Machine Translated Test Set, and Dolly Human Edited Test Set.
6.?????? Safety, Toxicity, and Bias – Evaluate the safety of model generations under adversarial prompts from the multilingual AdvBench
GPT 4 was used as an automatic evaluator for harmfulness on 120 test prompts. The Toxicity and bias are measured using Perspective API.
Access the paper using this link: https://arxiv.org/abs/2405.15032