GenAI: Synthesizing DNA Sequences with LLM Techniques

GenAI: Synthesizing DNA Sequences with LLM Techniques

This autoregressive LLM methodology is not focused on genome data alone. The purpose is to design a generic solution that may also work in other contexts, such as synthesizing molecules. The problem involves dealing with a large amount of “text”. Indeed, the sequences discussed here consist of letter arrangements, from an alphabet that has 5 symbols: A, C, G, T and N. The first four symbols stand for the types of bases found in a DNA molecule: adenine (A), cytosine (C), guanine (G), and thymine (T). The last one (N) represents missing data. No prior knowledge of genome sequencing is required.

Summary

The data consists of DNA sequences from a number of individuals and categorized according to the type of genetic patterns found in each sequence. The goal is to synthesize realistic DNA sequences, evaluate the quality of the synthetizations, and compare the results with random sequences. The idea is to look at a DNA string S_1 consisting of n_1 consecutive symbols, to identify potential candidates for the next string S_2 consisting of n_2 symbols. Then, assign a probability to each string S_2 conditionally on S_1, use these transition probabilities to sample S_2 given S_1, then move to the right by n_2 symbols, do it again, and so on. Eventually you build a synthetic sequence of arbitrary length. There is some analogy to Markov chains.

Python code and documentation

To access the code and documentation, including how to evaluate the results, follow this link.

Open source, free, no sign-up required. To not miss future articles including upcoming ones about LLM, check out our newsletter, here.

Evaluation scatterplot: Random (orange) vs synthetic (blue) vs real DNA (red diagonal)


Joseph Pareti

AI Consultant @ Joseph Pareti's AI Consulting Services | AI in CAE, HPC, Health Science

11 个月

and the result is what? something that illumina has been doing for a while ?

回复

Interesting. Will follow. Thanks ??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了