Learn First Multimodal LLM without Trouble and Perfect Medical LLM for Medicinal Research
Raghul Gopal
Data Science at Logitech | AWS Community Builder ??(ML & GenAI) | Talks and Writes about AI, AGI & Cloud Deployments of AI & AGI | Public Speaker ??| Blogger ??| Unlocking Data Secrets with Math & AI ????
Hello All,
This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI & AGI Research. Welcome to Learn with Me Newsletter Week 1 where I will be focusing on the advancements of Generative AI.
1.?????? Mirasol 3B (Multimodal Auto-Regressive Model for Time Aligned Video and Audio and Not-Time Aligned Contexts)
Right now, in the era of the Large Language Model, the main challenge is to combine multiple heterogeneous modalities say, audio, video, and text. We know that Video/Audio rates are higher than the text and they are roughly aligned in time. The audio and video are not synchronized with text which comes as a global context eg, Title/ Description.
Moreover, the volumes of video and audio inputs are larger when the length of the video increases, and for that, it needs more computing time for these modalities. To resolve these issues, the researchers introduced Mirasol 3B (which can be accessed from here: <>), which has two segments namely
To address the long sequence of Video/Audio inputs, the partition of Audio/Video is done into consecutive snippers and autoregressively processes their representations. One of the finest things in Mirasol 3B is the Combiner Mechanism, which is used in autoregressive components for time-synchronized modalities where the information jointly produces compact but expressive representations. These features make the model take the 512 frames as input without an increase in the model parameters.
Let’s see how Video/Audio-based transformers work. Note that Architecture for Video Language understanding commonly uses a Joint Transformer (Video Input + Text Tokens are processed autoregressively). Now, let’s have a look at how Mirasol 3B uses both Time Aligned and Not Time Aligned in conjunction to provide extraordinary features.
Natively, in Video Representations, the basic form is the Spatio Data Representations. To extract, they used Sparse 3D tubes with Standard 2D Patches and they are processed using the ViT Encoder. Similarly, in this research, the audio has been represented as Spectrograms.
There are two roles of the Combiner module namely
Since these Audio/Video Representations are Auto Regressive they have the condition to predict the next frame with the help of previous time intervals. So, the value xt?is passed sequentially to the autoregressive model.
Combining Both Auto-Regressive Models on Time Aligned and Not Time Aligned with the help of Latent Model Output h^?as cross attention to producing, where w is the tokenized text sequence with length L. Note that, their model has 3B Parameters, without audio it has 2.9B Parameters. See the experiments conducted on various benchmarks and they have conducted the experiments on Alblations.
领英推荐
2.?????? MediTron 70B – Medical Pretraining for Large Language Models
Large Language Models (LLMs) can potentially democratize access to medical knowledge. Moreover, the LLM's access to Medical data is less in store, having closed sources like (GPT-4, or PaLM), or limited in Scale (?Parameters). MediTron 70B (access it from here <>), built an open source LLM (7B, and 70B) which is being built on Llama2, through the adaptation of Nvidia’s Megatron LM Distributed Trainer. MediTron has been trained on Medical Corpus such as PubMed Articles, Abstracts, and more. The model has been evaluated in four medical reasoning benchmarks using both in
·???????? In Context Learning – Prompt within Context Window
·???????? Text Specific Fine-tuning
Let’s see the engineering behind the MediTron 70B:
To harness the power of large language model parameter size, and pre-training token count, they found MEGATRON LLM DISTRIBUTED TRAINING LIBRARY, which is being extended from Nvidia’s Megatron LM to support three open-source LLMs namely Llama, Falcon, and Llama2. The Megatron LLM training Library supports Data Parallelism (DP), Pipeline Parallelism (PP), and Tensor Parallelism (TP).
Megatron LM natively supports GPT-like architecture, but the researchers of MediTron 70B extended its functionality to support Llama, Falcon, and Llama2. The MediTron has two different models namely MediTron 7B with the context length of 2048, and MediTron 70B with the context length of 4096. Then integrated necessary new architecture features such as rotatory position embedding, grouped-query attention, the parallel attention/MLP in the transformer layer of Falcon-40B, and the unbinding of the word embedding and the next token prediction classifier weights used in Llama. They also added support for Flash Attention, Flash Attention 2 for more efficient inference, and long context decoding.
For the model architecture of Llama-2, they inherited the standard transformer architecture, the use of RMS Norm, the SwiGLU activation function, and the rotatory positional embedding. They also used GQA (Group Query Attention) introduced by Llama-2 here.
Don’t worry about the new features, we will be exploring them soon in our newsletter ??
They followed the OpenAI’s ChatML format to format the instruction data. ChatML document consists of a series of messages, starting with the special token <|im_start|> followed by the role (user\Assistant).
That’s it for Week 2. Happy Day, Happy AI.
Follow me here to learn more about the releases of AI, and AGI with a clear understanding ??
? Infrastructure Engineer ? DevOps ? SRE ? MLOps ? AIOps ? Helping companies scale their platforms to an enterprise grade level
10 个月Great to see the latest newsletter. Multimodal research sounds fascinating. #innovative Raghul Gopal
GEN AI Evangelist | #TechSherpa | #LiftOthersUp
10 个月Excited to dive into the newsletter. #learnwithme Raghul Gopal
Data Science at Logitech | AWS Community Builder ??(ML & GenAI) | Talks and Writes about AI, AGI & Cloud Deployments of AI & AGI | Public Speaker ??| Blogger ??| Unlocking Data Secrets with Math & AI ????
10 个月Mirasol 3B: https://arxiv.org/abs/2311.05698 MediTron 70B: https://arxiv.org/abs/2311.16079