登录查看更多内容

Mixed Modal FM ??- Chances of Llama 4 | Aya 23 - Successor of Aya 101 ???

Raghul Gopal

Data Science at Logitech | AWS Community Builder ??(ML & GenAI) | Talks and Writes about AI, AGI & Cloud Deployments of AI & AGI | Public Speaker ??| Blogger ??| Unlocking Data Secrets with Math & AI ????

发布日期: 2024年6月20日

1.?????? Chameleon – Mixed – Modal Early Fusion Foundational Model

As you might hear the news that the Chameleon is the next Llama, say Llama 4? I say, yes, it is because of the nature of Mixed – Modality of interleaving images and text. Chameleon is the early fusion token-based mixed–modal model capable of understanding and generating images and text in any arbitrary sequence. It does Visual Question Answering, Image Captioning, Text Generation, Image Generation, and long-form mixed modal generation. Chameleon outperforms Llama2 text-only tasks while being competitive with models such as Mistral 8x 7b and Gemini Pro, and outperforms non–trivial image generation (All in a Single Model).

Recent Multimodal foundational models are widely adapted but still model different modalities separately. They often use modality-specific encoders and decoders. This can limit their ability to integrate information across modalities and generate (multi-modal documents), which includes both images and text. To mitigate this, mixed–modal models are released namely Chameleon.

Now, let’s have a look at what Chameleon has

1.?????? Fully token-based representations for both images and text modalities

2.?????? By quantizing images into discrete tokens, analogous to words in a text, we can apply the same transformer architecture to sequence both image and text tokens without the need for separate image/text encoders or domain-specific decoders.

3.?????? However, it also represents significant technical challenges particularly in terms of optimization stability and scaling. The researchers solved these challenges through a combination of Architectural Innovation and Training Techniques. For eg. Query Key Normalization and Revised Placement of Layer Norms.

Chameleon 34B achieves state-of-art performance outperforming models like Flamingo, IDEFICS, and Lava 1.5. It also achieves competitive performance on text-only benchmarks matching models like Mistral 8x 7B, and Gemini Pro. Chameleon unlocks entirely new capabilities in terms of mixed–modal reasoning and generation. Chameleon 34B achieves a 60.4% preference rate against Gemini Pro, and a 51.4% preference rate over GPT 4V in a pairwise comparison.

Sample Interleaved Text and Image Generation from Chameleon Model

Let’s see some architectural features of Chameleon 34B below.

Tokenization – For Image Tokenization, a new image tokenizer has been used which encodes 512 x 512 images into 1024 discrete tokens from a Codebook of size 8192. In this, only licensed images have been used. The weakness of this method is the reconstruction images with a large amount of text when it comes to heavy OCR-related tasks.

BPE Tokenizer – BPE Tokenizer has been used over a subset of the training data outlined below with a vocabulary size of 65,536 which includes 8192 Image Codebook tokens, using the Sentence Piece Library.

Data Pre-training – The pretraining has been divided into two stages namely 80% of training data to stage 1, and 20% of training data to stage 2. For all text-to-image pairs, the researchers rotated so that 50% of the time the image comes before the text.? In stage 1, a combination of pre-training data is used to train Llama2 and CodeLlama. The text-image publicly available data sources and licensed data are used for images, and then resized and cropped into 512 x 512 images for tokenization.

Architecture of Chameleon – Largely follows Llama2 Architecture. It used the RMS Norm for normalization and uses SwiGLU as an activation function and Rotary Positional Embedding (RoPE). Standard Llama architecture showed complex divergence due to slow norm growth in the mid to late stage of training. It has narrowed down the problem to the softmax operation.

It is because sharing all the weights across modalities each modality will try to compete with each other by increasing its norm slightly. In the beginning, it is not a big problem but it gets diverged once we get outside the effective range of bf 16. Softmax Operation is applied in two places in a transformer. In Core Attention Mechanism, and Softmax over the logits.

Chameleon first deviates from Llama Architecture by using QK Norm (Query Key Normalization). The QK Norm controls the norm growth of the input to the softmax by applying the norm to the query and key vector within the attention. If you want to stabilize Chameleon 7B by controlling the Norm Growth, it is necessary to introduce Dropout after attention and feed-forward layers in addition to the QK Norm. This recipe was not used to stabilize the 34B Model. The benefit of the Swin Transformer strategy is that it bounds the norm growth of the feed-forward block which can become additionally problematic.

H = x + attention_norm(attention(x))

Output = H + ffn_norm(feed_forward(H))

H = x + attention(attention_norm(x))

Output = H + feed_forward(ffn_norm(x))

The first two equations support the Chameleon 34B, and the next two equation support the Llama 2 Model.

Note: For Chameleon 34B, QK Norm is needed without Dropout. For Chameleon 7B, QK Norm is needed with Dropout. It can be treated without Dropout when using norm–reordering.

Example Alignment Data for Different Categories

AdamW is used as an Optimizer with b1 = 0.9, b2 = 0.95, and E = 10^-5

Note: The application of QK Norm while helping the inner softmax within the transformer does not solve the problem of logit shift in the final softmax. So, to mitigate this, Z – Loss Regularization is applied. It is very important to know that for Chameleon 7B, both dropout and Z–loss has been used for stability. For Chameleon 34B, Z – Loss Regularization is only used.

In Supervised Fine-tuning, the categories namely Text, Code, Visual Chat, Image Generation, Interleaved Text / Image Generation, and Safety. For Data Safety which potentially provokes the model to produce unsafe data and match them with a refusal response, I can’t help with that which controls sensitivity topics namely violence, controlled substances, privacy, and sexual content.

Performance of Chameleon vs other models

Comparison of performance of Chameleon vs other models on wide ranges of benchmarks

Image To Text Modal Performance of Chameleon and Other Models

Access the paper using this link: https://arxiv.org/abs/2405.09818

2.?????? Aya 23 – Family of Multilingual Language Models by Cohere for AI.

Aya 23 covers 23 languages. Initially, Aya 101 covered 101 languages whereas Aya 23 is an experiment in depth vs breadth exploring the impact of allocating more capacity to fewer languages that are included during pre-training. Aya 23 outperforms both previous massively multilingual models like Aya 101 if covered and models like Gemma, Mistral, Mixtral, etc.

领英推荐

Weekly Research Roundup (29 july - 5 aug)

Generative AI 7 个月前

ICCV 2023 Survival Guide: 10 Computer Vision Papers…

Voxel51 1 年前

#GESmart: What are the opportunities of generative AI…

Ground Engineering Magazine 5 个月前

Aya 23 is based on the Cohere Command Model and the Aya Multilingual style collection. Two major hurdles in the development of powerful multilingual models are,

·???????? The lack of robust multilingual pretrained models

·???????? The Scarcity of instruction–style training data covering the diverse set of languages.

To mitigate these hurdles, Aya 101 has been released. But, the mitigations behind Aya 101 are outdated knowledge and inadequate performance. Furthermore, Model Aya 101 has a 13 billion parameter model designed for breadth to nearly double that achieved by previous models like 101 languages.

Aya 23 is now available in two model sizes namely 8 billion, and 35 billion. Please note that, Aya 23 35B is trained on Cohere Command R Model.

Let’s take a look at the architecture.

·???????? It is a Standard Decoder–only transformer.

·???????? It has parallel attention & FFN layers. The parallel attention is similar to PaLM 2 parallel block architecture that leads to significant improvement in training efficiency without hurting model quality, especially in tensor parallel TP–settings.

·???????? Uses SwiGLU Activation Layer

·???????? It has no biases which is similar to PaLM 2 – removed all biases from dense layers to improve the training stability.

·???????? Uses Rotary Positional Embeddings (RoPE)

·???????? Uses BPE Tokenizer of 256K and uses NFC Normalization and digits split into tokens.

·???????? Uses Grouped Query Attention (GQA)

All base models are trained using FAX – A Jax-based distributed training framework on TPU V4 Chips.

Let’s take a look at the Data Mixtures in Aya 23.

· Multi-lingual Templates – XP3X Dataset, Data Provenance Collection, and the Aya Collection.

·???????? Human Annotations

·???????? Translated Data – Translations of Hotpot QA, and Flan – COT Submix

·???????? Synthetic Data – from ShareGPT and Dolly 15K.

For instruction finetuning, finetuned the base models for 13,200 update steps using an 8192 context length. Adam Optimized with a Cosine Schedule learning rate has been used with a peak LR of 6x10-4 and an end LR of 6x10-5, and a batch size of 64. For all training runs, they used TPU V4 with upto 128 pod slices.

Let’s take a look at how they evaluated the multilingual datasets. The Multilingual Evaluation Techniques are the same as those used in Aya 101 but eval harness has been used as another evaluation measure.

1.?????? Completely unseen Discriminative Tasks – Xwinograd, XCOPA, and XStoryCloze

2.?????? General-Purpose Language Understanding – Multilingual MMLU

3.?????? Multilingual Mathematical Reasoning – Multilingual Grade School Math (MGSM)

4.?????? Generative tasks – Machine Translation and Summarization on FLORES 200 and XLSum

5.?????? Preference Evaluation – Open-ended generation Capabilities of the model through human and LLM simulated evaluation using 1. Dolly Machine Translated Test Set, and Dolly Human Edited Test Set.

6.?????? Safety, Toxicity, and Bias – Evaluate the safety of model generations under adversarial prompts from the multilingual AdvBench

GPT 4 was used as an automatic evaluator for harmfulness on 120 test prompts. The Toxicity and bias are measured using Perspective API.

Multilingual Performance of Aya 23 models compared with other models on different benchmarks

Multilingual Grade School Math Benchmark (MGSM) Results

Translation (FLORES) and Multilingual Summarization (XLSum) Results

Access the paper using this link: https://arxiv.org/abs/2405.15032

Learn with Me

1,505 位关注者

要查看或添加评论，请登录

Raghul Gopal的更多文章

Attention as an RNN - Aaren ?? | Don't Memorize - Be like a Goldfish??to mitigate Memorization in LLMs ??

2024年6月20日

Attention as an RNN - Aaren ?? | Don't Memorize - Be like a Goldfish??to mitigate Memorization in LLMs ??

1. Attention as an RNN Transformers models marked a significant breakthrough in sequence modeling providing a highly…

1 条评论
Safety Responses Automation ??| Segment Anything with Lightweight Model ??|?? - Release #9

2024年5月25日

Safety Responses Automation ??| Segment Anything with Lightweight Model ??|?? - Release #9

Hello All, This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI &…

1 条评论
The first releases of Code LLM - Code Intelligence Breakdown | Multi-Program Synthesis

2024年5月24日

The first releases of Code LLM - Code Intelligence Breakdown | Multi-Program Synthesis

Hello All, This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI &…
Predecessor of Phi3 ????- Textbook Are All You Need ??| Speech-to-Speech ????Translation with Monolingual Data ??

2024年5月12日

Predecessor of Phi3 ????- Textbook Are All You Need ??| Speech-to-Speech ????Translation with Monolingual Data ??

Hello All, This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI &…

1 条评论
Magician behind Coding ????♂???♀?| SLMs are the best? ??♂???♀?

2024年4月30日

Magician behind Coding ????♂???♀?| SLMs are the best? ??♂???♀?

Hello All, This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI &…
Interbreeding Camels ????Version 2 - Camels in a Changing Climate

2024年4月24日

Interbreeding Camels ????Version 2 - Camels in a Changing Climate

Hello All, This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI &…
Refresh LLMs with SE Data ?? | Interbreeding of Camels ??

2024年4月23日

Refresh LLMs with SE Data ?? | Interbreeding of Camels ??

Hello All, This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI &…

1 条评论
Learn First Multimodal LLM without Trouble and Perfect Medical LLM for Medicinal Research

2024年4月17日

Learn First Multimodal LLM without Trouble and Perfect Medical LLM for Medicinal Research

Hello All, This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI &…

3 条评论
Fine Tune LLMs - Don't go for Billion Parameters ??

2024年4月13日

Fine Tune LLMs - Don't go for Billion Parameters ??

Hello All, This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI &…

2 条评论
Focusing on Attention and Hallucinations

2024年4月10日

Focusing on Attention and Hallucinations

Hello All ???? This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI…

See all articles

Mixed Modal FM ??- Chances of Llama 4 | Aya 23 - Successor of Aya 101 ???

Raghul Gopal

Data Science at Logitech | AWS Community Builder ??(ML & GenAI) | Talks and Writes about AI, AGI & Cloud Deployments of AI & AGI | Public Speaker ??| Blogger ??| Unlocking Data Secrets with Math & AI ????

领英推荐

Learn with Me

1,505 位关注者

Raghul Gopal的更多文章

社区洞察

其他会员也浏览了

How Does Stable Diffusion Work? Explained

A new generation of remote sensing software technology system for rapid application of data (Ⅱ)

How to Build Better AI Models with a Production-Aware Approach and NAS

A First Demonstration of Thermodynamic Matrix Inversion

Image Enhancement Technology: OPT New Series Smart Code Reader - designed for Precise Decoding

XGain Technical Notes: Optimisation Algorithm for AI Services

AI Papers Review (November 2024 edition)

Combining ControlNet and LoRA with Stable Diffusion XL

Celebrating a crazy month of Open Multimodal LLM Releases

Paper Review: Project Sid: Many-agent simulations toward AI civilization

领英推荐

Learn with Me

1,505 位关注者

Raghul Gopal的更多文章

Attention as an RNN - Aaren ?? | Don't Memorize - Be like a Goldfish??to mitigate Memorization in LLMs ??

Safety Responses Automation ??| Segment Anything with Lightweight Model ??|?? - Release #9

The first releases of Code LLM - Code Intelligence Breakdown | Multi-Program Synthesis

Predecessor of Phi3 ????- Textbook Are All You Need ??| Speech-to-Speech ????Translation with Monolingual Data ??

Magician behind Coding ????♂???♀?| SLMs are the best? ??♂???♀?

Interbreeding Camels ????Version 2 - Camels in a Changing Climate

Refresh LLMs with SE Data ?? | Interbreeding of Camels ??

Learn First Multimodal LLM without Trouble and Perfect Medical LLM for Medicinal Research

Fine Tune LLMs - Don't go for Billion Parameters ??

Focusing on Attention and Hallucinations

社区洞察

其他会员也浏览了

How Does Stable Diffusion Work? Explained

A new generation of remote sensing software technology system for rapid application of data (Ⅱ)

How to Build Better AI Models with a Production-Aware Approach and NAS

A First Demonstration of Thermodynamic Matrix Inversion

Image Enhancement Technology: OPT New Series Smart Code Reader - designed for Precise Decoding

XGain Technical Notes: Optimisation Algorithm for AI Services

AI Papers Review (November 2024 edition)

Combining ControlNet and LoRA with Stable Diffusion XL

Celebrating a crazy month of Open Multimodal LLM Releases

Paper Review: Project Sid: Many-agent simulations toward AI civilization