Learn First Multimodal LLM without Trouble and Perfect Medical LLM for Medicinal Research

Learn First Multimodal LLM without Trouble and Perfect Medical LLM for Medicinal Research

Hello All,

This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI & AGI Research. Welcome to Learn with Me Newsletter Week 1 where I will be focusing on the advancements of Generative AI.

1.?????? Mirasol 3B (Multimodal Auto-Regressive Model for Time Aligned Video and Audio and Not-Time Aligned Contexts)

Right now, in the era of the Large Language Model, the main challenge is to combine multiple heterogeneous modalities say, audio, video, and text. We know that Video/Audio rates are higher than the text and they are roughly aligned in time. The audio and video are not synchronized with text which comes as a global context eg, Title/ Description.

Moreover, the volumes of video and audio inputs are larger when the length of the video increases, and for that, it needs more computing time for these modalities. To resolve these issues, the researchers introduced Mirasol 3B (which can be accessed from here: <>), which has two segments namely

  • Auto-Regressive Component for Time Synchronized Modalities (Audio/Video)
  • Auto-Regressive Component for Context Modalities which are not necessarily aligned with time but sequential (basically not time aligned)

Mirasol 3B Architecture: Left – Aligned Modalities (Video and Audio) Right – Non-Aligned Modalities (Text)

To address the long sequence of Video/Audio inputs, the partition of Audio/Video is done into consecutive snippers and autoregressively processes their representations. One of the finest things in Mirasol 3B is the Combiner Mechanism, which is used in autoregressive components for time-synchronized modalities where the information jointly produces compact but expressive representations. These features make the model take the 512 frames as input without an increase in the model parameters.

Let’s see how Video/Audio-based transformers work. Note that Architecture for Video Language understanding commonly uses a Joint Transformer (Video Input + Text Tokens are processed autoregressively). Now, let’s have a look at how Mirasol 3B uses both Time Aligned and Not Time Aligned in conjunction to provide extraordinary features.

Natively, in Video Representations, the basic form is the Spatio Data Representations. To extract, they used Sparse 3D tubes with Standard 2D Patches and they are processed using the ViT Encoder. Similarly, in this research, the audio has been represented as Spectrograms.

Auto-Regressive Modeling of Video and Audio in time

There are two roles of the Combiner module namely

  • Combine Video/Audio features at a specific snippet at the time (Joint Representations)
  • Effectively compress the representation from each audio/video snippet, allowing the model to scale to longer videos.

Combiners: Left – Standard Transformer Combiner Right – TTM (Token Turing Machine)

Since these Audio/Video Representations are Auto Regressive they have the condition to predict the next frame with the help of previous time intervals. So, the value xt?is passed sequentially to the autoregressive model.

Combining Both Auto-Regressive Models on Time Aligned and Not Time Aligned with the help of Latent Model Output h^?as cross attention to producing, where w is the tokenized text sequence with length L. Note that, their model has 3B Parameters, without audio it has 2.9B Parameters. See the experiments conducted on various benchmarks and they have conducted the experiments on Alblations.

Results of Mirasol 3B on MSRVTT-QA Benchmark. Furthermore, Mirasol 3B has been compared with other state-of-art models.
Results of Mirasol 3B (Long Video QA) on Activity Net Benchmark. Furthermore, Mirasol 3B has been compared with other state-of-art models.
Results of Mirasol 3B (Long Video QA) on NExT-QA Benchmark. Furthermore, Mirasol 3B has been compared with other state-of-art models.
Audio-Video results on Kinetics-Sound, VGG-Sound, and Epic-Sound of Mirasol 3B

2.?????? MediTron 70B – Medical Pretraining for Large Language Models

Large Language Models (LLMs) can potentially democratize access to medical knowledge. Moreover, the LLM's access to Medical data is less in store, having closed sources like (GPT-4, or PaLM), or limited in Scale (?Parameters). MediTron 70B (access it from here <>), built an open source LLM (7B, and 70B) which is being built on Llama2, through the adaptation of Nvidia’s Megatron LM Distributed Trainer. MediTron has been trained on Medical Corpus such as PubMed Articles, Abstracts, and more. The model has been evaluated in four medical reasoning benchmarks using both in

·???????? In Context Learning – Prompt within Context Window

·???????? Text Specific Fine-tuning

Complete Pipeline of MediTron

Let’s see the engineering behind the MediTron 70B:

To harness the power of large language model parameter size, and pre-training token count, they found MEGATRON LLM DISTRIBUTED TRAINING LIBRARY, which is being extended from Nvidia’s Megatron LM to support three open-source LLMs namely Llama, Falcon, and Llama2. The Megatron LLM training Library supports Data Parallelism (DP), Pipeline Parallelism (PP), and Tensor Parallelism (TP).

Megatron LM natively supports GPT-like architecture, but the researchers of MediTron 70B extended its functionality to support Llama, Falcon, and Llama2. The MediTron has two different models namely MediTron 7B with the context length of 2048, and MediTron 70B with the context length of 4096. Then integrated necessary new architecture features such as rotatory position embedding, grouped-query attention, the parallel attention/MLP in the transformer layer of Falcon-40B, and the unbinding of the word embedding and the next token prediction classifier weights used in Llama. They also added support for Flash Attention, Flash Attention 2 for more efficient inference, and long context decoding.

For the model architecture of Llama-2, they inherited the standard transformer architecture, the use of RMS Norm, the SwiGLU activation function, and the rotatory positional embedding. They also used GQA (Group Query Attention) introduced by Llama-2 here.

Don’t worry about the new features, we will be exploring them soon in our newsletter ??

They followed the OpenAI’s ChatML format to format the instruction data. ChatML document consists of a series of messages, starting with the special token <|im_start|> followed by the role (user\Assistant).

Example view of OpenAI’s ChatML format
MediTron Performance on MedQA Benchmark
Few Shot Learning Results of MediTron Model on Different Benchmarks namely MedQA, MedMCQA, and more. Further, it is compared with other state-of-art models.
MediTron Comparison with other source models. Further, the models are tested with other inferences namely Top Token Selection, Chain of Thoughts, and Self Consistency Chain of Thoughts respectively.
Comparison of MediTron 70B with other Commercial LLMs such as GPT 3.5, MedPaLM 540B, GPT-4, and MedPaLM-2-540B

That’s it for Week 2. Happy Day, Happy AI.

Follow me here to learn more about the releases of AI, and AGI with a clear understanding ??


Marcelo Grebois

? Infrastructure Engineer ? DevOps ? SRE ? MLOps ? AIOps ? Helping companies scale their platforms to an enterprise grade level

10 个月

Great to see the latest newsletter. Multimodal research sounds fascinating. #innovative Raghul Gopal

Pete Grett

GEN AI Evangelist | #TechSherpa | #LiftOthersUp

10 个月

Excited to dive into the newsletter. #learnwithme Raghul Gopal

Raghul Gopal

Data Science at Logitech | AWS Community Builder ??(ML & GenAI) | Talks and Writes about AI, AGI & Cloud Deployments of AI & AGI | Public Speaker ??| Blogger ??| Unlocking Data Secrets with Math & AI ????

10 个月
回复

要查看或添加评论,请登录

Raghul Gopal的更多文章

社区洞察

其他会员也浏览了