A large language model (LLM) for mRNAs
Do you know there is a large language model, known as CodonBERT, for mRNA analysis and prediction tasks? CodonBERT’s architecture is provided.? mRNA-based vaccines and therapeutics are gaining popularity and usage across a wide range of conditions. One of the critical issues when designing such mRNAs is sequence optimisation. Even small proteins or peptides can be encoded by an enormously large number of mRNAs. The actual mRNA sequence can have a large impact on several properties including expression, stability, immunogenicity, and more. To enable the selection of an optimal sequence, we developed CodonBERT, a large language model (LLM) for mRNAs. Unlike prior models, CodonBERT uses codons as inputs which enables it to learn better representations. CodonBERT was trained using more than 10 million mRNA sequences from a diverse set of organisms. The resulting model captures important biological concepts.
CodonBERT can also be extended to perform prediction tasks for various mRNA properties. CodonBERT outperforms previous mRNA prediction methods including on a new flu vaccine dataset.
We pre-trained CodonBERT with two tasks: masked language model (MLM) learning and homologous sequences prediction (HSP).
CodonBERT learns, on its own, the genetic code and evolutionary homology In addition to the quantitative evaluation of model predictions, e.g., loss and accuracy, we also performed several qualitative analyses of the embeddings provided by CodonBERT. To decipher what kind of biological information has been learned by the model and encoded in the representation, we randomly sampled 500 sequences for each category from the held-out dataset and extracted high-dimensional codon and sequence embeddings from CodonBERT. These were projected onto a 2-dimensional space by UMAP.
领英推荐
The model Structure for codons
A codon is composed of three adjacent nucleotides. There are five different options for each of these three positions {A, U, G, C, N} leading to a total of 125 possible combinations. Additionally, five special tokens are added to the 6 available under aCC-BY-ND 4.0 International license. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made bioRxiv preprint doi: https://doi.org/10.1101/2023.09.09.556981; this version posted September 12, 2023. The copyright holder for this preprint vocabulary: classifier token [CLS], separator token [SEP], unknown token [UNK], padding token [PAD], and masking token [MASK]. Thus, in total, there are 130 tokens in the vocabulary of CodonBERT.
To enable the analysis and prediction of mRNA properties, we utilized 10 million mRNA coding sequences (CDS) from several species to train a large language model (CodonBERT), and to establish a foundational model. The model optimizes two self-supervised tasks: codon completion and homology detection. Like other unsupervised LLMs, we expected that such a foundational model will learn to capture aspects of natural selection that favors mRNA sequences with high expression and stable structure. Analysis of the resulting model indicates that it indeed learns several relevant biological properties for codons and sequences.Projection of codon embedding obtained from CodonBERT produces distinct clusters that adhere to the amino acid types. In-depth analysis of CodonBERT representation of a set of genes from different organisms revealed that CodonBERT autonomously learns the genetic code and principles of evolutionary homology. The projection of the clusters leads occurrences shape (Figure 2(c–d)) which separates organisms as well as genes based on their functions. This may indicate that CodonBERT not only learns the sequence of evolutionary occurrence but can learn a pseudo evolutionary tree as part of its embeddings.
d CodonBERT to perform several supervised prediction tasks for mRNA properties. These include datasets testing for recombinant protein expression, mRNA degradation, mRNA stability, and more. Our results indicate that CodonBERT is the top performing method overall and ranks first or second in performance for six of the seven tasks. All other methods we compared to performed poorly on all, or some of the tasks.
Stability of this Model
Stability is known to be structure-dependent, and stable structures such as stem-loops or hairpin structures can impede degradation enzymes, protecting the mRNA from rapid decay. A possible reason for the reduction in performance for this dataset is that structural properties are highly dependent on nucleotides whereas CodonBERT is a codon-based model. One possible solution for this is a model that combines codon and nucleotide representation. Similarly, mRNA modification events including capping at the 5′ end and polyadenylation at the 3′ end in eukaryotes are not currently encoded in our model but can also impact mRNA stability.
To conclude, our findings suggest that CodonBERT could serve as a versatile and foundational model for the development of new mRNA-based vaccines and the engineering and recombinant production of industrial and therapeutic proteins.
Author: Sejyoti Chakraborty