登录查看更多内容

A large language model (LLM) for mRNAs

Boltzmann Labs

Accelerating Drug Discovery using AI

发布日期: 2024年6月17日

https://boltzmann.co/post/6tQnn56bj5zIdOzdYBSG

Do you know there is a large language model, known as CodonBERT, for mRNA analysis and prediction tasks? CodonBERT’s architecture is provided.? mRNA-based vaccines and therapeutics are gaining popularity and usage across a wide range of conditions. One of the critical issues when designing such mRNAs is sequence optimisation. Even small proteins or peptides can be encoded by an enormously large number of mRNAs. The actual mRNA sequence can have a large impact on several properties including expression, stability, immunogenicity, and more. To enable the selection of an optimal sequence, we developed CodonBERT, a large language model (LLM) for mRNAs. Unlike prior models, CodonBERT uses codons as inputs which enables it to learn better representations. CodonBERT was trained using more than 10 million mRNA sequences from a diverse set of organisms. The resulting model captures important biological concepts.

CodonBERT can also be extended to perform prediction tasks for various mRNA properties. CodonBERT outperforms previous mRNA prediction methods including on a new flu vaccine dataset.

We pre-trained CodonBERT with two tasks: masked language model (MLM) learning and homologous sequences prediction (HSP).

CodonBERT learns, on its own, the genetic code and evolutionary homology In addition to the quantitative evaluation of model predictions, e.g., loss and accuracy, we also performed several qualitative analyses of the embeddings provided by CodonBERT. To decipher what kind of biological information has been learned by the model and encoded in the representation, we randomly sampled 500 sequences for each category from the held-out dataset and extracted high-dimensional codon and sequence embeddings from CodonBERT. These were projected onto a 2-dimensional space by UMAP.

领英推荐

RNA-Seq Analysis with OmicsLogic: Empowering Your…

OmicsLogic - Biology as Data Science 8 个月前

AI in Genome Sequencing – Artificial Intelligence’s…

Kodexo Labs 10 个月前

Unlocking the Power of T-Bioinfo for Comprehensive…

OmicsLogic Inc. 5 个月前

The model Structure for codons

A codon is composed of three adjacent nucleotides. There are five different options for each of these three positions {A, U, G, C, N} leading to a total of 125 possible combinations. Additionally, five special tokens are added to the 6 available under aCC-BY-ND 4.0 International license. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made bioRxiv preprint doi: https://doi.org/10.1101/2023.09.09.556981; this version posted September 12, 2023. The copyright holder for this preprint vocabulary: classifier token [CLS], separator token [SEP], unknown token [UNK], padding token [PAD], and masking token [MASK]. Thus, in total, there are 130 tokens in the vocabulary of CodonBERT.

To enable the analysis and prediction of mRNA properties, we utilized 10 million mRNA coding sequences (CDS) from several species to train a large language model (CodonBERT), and to establish a foundational model. The model optimizes two self-supervised tasks: codon completion and homology detection. Like other unsupervised LLMs, we expected that such a foundational model will learn to capture aspects of natural selection that favors mRNA sequences with high expression and stable structure. Analysis of the resulting model indicates that it indeed learns several relevant biological properties for codons and sequences.Projection of codon embedding obtained from CodonBERT produces distinct clusters that adhere to the amino acid types. In-depth analysis of CodonBERT representation of a set of genes from different organisms revealed that CodonBERT autonomously learns the genetic code and principles of evolutionary homology. The projection of the clusters leads occurrences shape (Figure 2(c–d)) which separates organisms as well as genes based on their functions. This may indicate that CodonBERT not only learns the sequence of evolutionary occurrence but can learn a pseudo evolutionary tree as part of its embeddings.

d CodonBERT to perform several supervised prediction tasks for mRNA properties. These include datasets testing for recombinant protein expression, mRNA degradation, mRNA stability, and more. Our results indicate that CodonBERT is the top performing method overall and ranks first or second in performance for six of the seven tasks. All other methods we compared to performed poorly on all, or some of the tasks.

Stability of this Model

Stability is known to be structure-dependent, and stable structures such as stem-loops or hairpin structures can impede degradation enzymes, protecting the mRNA from rapid decay. A possible reason for the reduction in performance for this dataset is that structural properties are highly dependent on nucleotides whereas CodonBERT is a codon-based model. One possible solution for this is a model that combines codon and nucleotide representation. Similarly, mRNA modification events including capping at the 5′ end and polyadenylation at the 3′ end in eukaryotes are not currently encoded in our model but can also impact mRNA stability.

To conclude, our findings suggest that CodonBERT could serve as a versatile and foundational model for the development of new mRNA-based vaccines and the engineering and recombinant production of industrial and therapeutic proteins.

Author: Sejyoti Chakraborty

要查看或添加评论，请登录

Boltzmann Labs的更多文章

See all articles

A large language model (LLM) for mRNAs

Boltzmann Labs

Accelerating Drug Discovery using AI

领英推荐

Boltzmann Labs的更多文章

社区洞察

其他会员也浏览了

AI and Computer Vision in Genome Analysis

Healthcare: Can Genomics Using AI Be Made Better?

The Biggest Artificial Intelligence Milestones Of The Decade So Far

Microarray data Analysis Overview

04 Bio Genome High – UGenome AI | What, How, Why

EPIC proportions of discovery: CAMP4’s AI model for identifying regRNA targets

In-Sync with Speciale July Edition

03 Bio Genome High – UGenome AI | Diversity & Precision Health

An introduction to radiogenomics - GLI recommendations

Breaking Boundaries: The Next Gen Sequencing Odyssey

领英推荐

Boltzmann Labs的更多文章

Site Selection in Clinical Trials: A Comprehensive Guide

Boltzmann Labs 2024: Rewriting the Code of Drug Discovery

A Breakthrough in Protein-Language Modelling

The Future of Biomarker Discovery: Revolutionizing Healthcare

ACTIVE LEARNING AND ITS APPLICATIONS IN DRUG DISCOVERY

State of Retrosynthesis in Machine Learning Era (Part :- 2 Navigating the Synthetic Pathway)

State of Retrosynthesis in Machine Learning era (Part 1 - A brief synopsis)

The Alchemist's Toolbox: Designing Catalysts for Tomorrow's Chemistries

Designing Proteins in Cost-efficient way using In-painting and Diffusion Models

Exploring Spatial Omics: The Mapping the Microscopic World

社区洞察

其他会员也浏览了

AI and Computer Vision in Genome Analysis

Healthcare: Can Genomics Using AI Be Made Better?

The Biggest Artificial Intelligence Milestones Of The Decade So Far

Microarray data Analysis Overview

04 Bio Genome High – UGenome AI | What, How, Why

EPIC proportions of discovery: CAMP4’s AI model for identifying regRNA targets

In-Sync with Speciale July Edition

03 Bio Genome High – UGenome AI | Diversity & Precision Health

An introduction to radiogenomics - GLI recommendations

Breaking Boundaries: The Next Gen Sequencing Odyssey