Generative AI in Biology: Exploring Mathematical Functions Behind the Methods
Author: Emmimal Alexander
#Newsletter-Issue 003
Welcome to the Newsletter "Mathematical Functions Describe the Biological World." This newsletter will publish a series of articles designed to emphasize the indispensable role of mathematics in biology, computational biology, and bioinformatics. These articles will show how mathematical functions form the backbone of many of the methods and algorithms that bioinformaticians rely on every day.
When I first started studying machine learning, the mathematics was hard and confusing. But as I continued to learn, I realised that these mathematical ideas are powerful tools to understand the concepts behind AI models. Here, let us explore how generative AI and its mathematical concepts help us understand biological data in new ways.
Generative AI models show us how mathematics can connect theory to real life. They allow us to simulate and predict biological events in ways we could not before. By using probability, optimization and neural networks, we have amazing tools that expand our knowledge of biology.
Join me as I explain the mathematical foundations of generative AI and show you why it is changing biology so much.
Exploring the Power of Generative AI: Creating Synthetic DNA for Innovative Research
Imagine if you could create completely new DNA sequences without having a real biological sample. It may sound like a science fiction movie, but it's becoming a reality. This advanced technology uses complex mathematical formulas to generate synthetic biological data, such as DNA sequences, which can then be used to train machine learning models.
Generative AI, particularly methods like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), has shown remarkable potential in this area. These techniques allow researchers to generate artificial DNA sequences that closely resemble real biological data, providing a valuable tool for research and model training.
Simply put, generative models help us create data that looks like real biological information without the need for real samples. This is extremely helpful for scientists who need huge amounts of data but struggle to get enough real samples. In this article, we will explore how these technologies work, their applications and the benefits they offer. Join me as we explore the exciting intersection of mathematics, technology and biology and discover how synthetic DNA is revolutionizing the way we conduct research and develop new models.
Generative AI Models in Biology
Historical Context and Milestones
Generative AI models have brought major improvements in many areas, including biology.
In 2014, Ian Goodfellow and his team introduced GANs (Generative Adversarial Networks), which can generate very realistic images. This method has also been used to generate biological data.
GANs have shown significant potential in generating synthetic DNA sequences and creating synthetic gene expression profiles that can help researchers understand gene regulation and variability without the need for extensive biological samples (Killoran et al., 2017; Le et al., 2023). By training GANs on large datasets of DNA sequences, researchers can generate new sequences that maintain the structural and functional features of the original data
In 2013, Kingma and Welling developed Variational Autoencoders (VAEs). These models help us understand complex data patterns and have been used in biology to generate fake gene expressions and predict protein structures. For example, VAEs have been used to generate novel protein variants, such as bacterial luciferase. They have been shown to be able to generate protein sequences with specific properties such as improved solubility, which were then synthesized and tested in E. coli (Hawkins-Hooke et al., 2021). In addition, VAEs have been used to generate realistic tertiary protein structures by learning the latent space of protein contact maps, enabling the creation of novel and higher quality protein structures (Guo et al., 2020). Another application is the use of VAEs to generate 3D models of blood vessels, demonstrating their effectiveness in creating accurate vascular structures for medical imaging and anatomical studies (Hsu et al., 2023). These examples illustrate the versatility and powerful generative capabilities of VAEs in advancing biological research.
A 2017 study by Way, G. P., & Greene, C. S. showed how Variational Autoencoders (VAEs) can be used to analyze gene expression. They used VAEs to create a biologically meaningful latent space from cancer transcriptomes, illustrating the potential of these models for understanding gene expression and variability.
Variational autoencoders (VAEs) are commonly used to analyze single-cell RNA sequencing data, as shown by Erfanian et al. Their utility is expected to extend to various spatial transcriptome analyzes, offering promising applications in this field. They used VAEs to create realistic gene expression profiles and showed how these models can help us understand gene regulation and variability.
These advances in generative models are helping us learn more about biology in exciting new ways.
Generative Adversarial Networks (GANs) for Synthetic DNA
Ian Goodfellow and his team introduced Generative Adversarial Networks (GANs) in 2014. This was a major step forward in the field of machine learning and led to new research and applications.
GANs are machine learning tools that generate new, synthetic data by learning from existing data. They have two main components:
1. Generator: This neural network creates synthetic data.
2. Discriminator: This neural network checks if the data is real or fake.
How GANs Work
Discriminator Loss (D_loss)
This tells us how well the discriminator can tell the difference between real and fake DNA sequences:
Let Us Apply Mathematical formula here,
Were D(x) is the discriminator's estimate of the probability that real data x is real, and D(G(z)) is the discriminator's estimate that generated data G(z) is real.
The main goal of this process here is: The discriminator tries to get better at distinguishing between real and fake DNA sequences. It learns to improve its accuracy through this loss.
Generator Loss (D_loss)
G_loss shows how well the generator can trick the discriminator into thinking synthetic DNA sequences are real.
Let us see How it works:
Apply Mathematical formula here,
Where G(z)?is the generated data from the generator using random noise z
The main aim of this generator is to make the synthetic DNA sequences so realistic that the discriminator will mistakenly think they are real. It adjusts based on this loss to improve its output.
Here is the Simple Python Code Snippet
Discriminator Model
from keras.models import Sequential
from keras.layers import Dense, LeakyReLU
import numpy as np
# Function to build the Discriminator model
def build_discriminator():
model = Sequential([
Dense(16, input_dim=8), # Input is a DNA sequence of dimension 8
LeakyReLU(alpha=0.2),
Dense(8),
LeakyReLU(alpha=0.2),
Dense(1, activation='sigmoid') # Output is the probability of being real
])
return model
GAN Model
# Function to build the GAN model
def build_gan(generator, discriminator):
discriminator.trainable = False # We only train the generator here
model = Sequential([generator, discriminator])
return model
# Build the generator model
generator = build_generator()
# Build the GAN model
gan = build_gan(generator, discriminator)
# Print the summary of the GAN to verify the model structure
gan.summary()
# Example input: DNA sequence into an 8-dimensional feature vector
real_dna_sequence = "ATCGTAGC"
# Hypothetical transformation to 8-dimensional feature
encoded_real_sequence = np.array([0.25, 0.25, 0.25, 0.25, 0.3, 0.2, 0.4, 0.1])
# Create synthetic data identical to real data for testing
encoded_synthetic_sequence = np.array([0.25, 0.25, 0.25, 0.25, 0.3, 0.2, 0.4, 0.1])
# Build the discriminator model
discriminator = build_discriminator()
# Print the summary of the discriminator to verify the model structure
discriminator.summary()
# Evaluate the real and synthetic sequences
real_output = discriminator.predict(encoded_real_sequence.reshape(1, -1))
synthetic_output = discriminator.predict(encoded_synthetic_sequence.reshape(1, -1))
print(f"Input Features (Real): {encoded_real_sequence}")
print(f"Output Probability (Real): {real_output[0][0]}")
print(f"Input Features (Synthetic): {encoded_synthetic_sequence}")
print(f"Output Probability (Synthetic): {synthetic_output[0][0]}")
# Measure the Discriminator Loss
real_label = np.array([1])
fake_label = np.array([0])
d_loss_real = discriminator.evaluate(encoded_real_sequence.reshape(1, -1), real_label)
d_loss_fake = discriminator.evaluate(encoded_synthetic_sequence.reshape(1, -1), fake_label)
print(f"Discriminator Loss (Real): {d_loss_real}")
print(f"Discriminator Loss (Synthetic): {d_loss_fake}")
Applying These Concepts to Generate Synthetic DNA
Train the Discriminator:
o?? Feed real and synthetic DNA sequences to the discriminator.
o?? Measure how well the discriminator distinguishes between them using .
o?? Adjust the discriminator to improve its accuracy.
Train the Generator:
o The generator creates synthetic DNA sequences from random noise.
o The discriminator evaluates these sequences.
o Measure how well the generator fools the discriminator using .
o Adjust the generator to make more convincing synthetic DNA sequences.
This process is repeated many times. As the discriminator gets better at spotting fakes, the generator gets better at creating sequences that look real. This results in high-quality synthetic DNA sequences.
Evaluation Metrics
To check the quality of the synthetic DNA:
By following these steps, GANs can generate realistic and useful synthetic DNA sequences for various research applications, such as DNA vaccine studies, gene expression studies, drug development, and genes with biotechnological potential, or even to train deep learning models, reducing the number of biological samples (animals, plants, etc.) needed.
Example and Visualization
Generator Network (G): Starts with a random noise vector z and produces a DNA sequence G(z).
Discriminator Network (D): Takes a DNA sequence and outputs a probability score D(x) indicating whether the sequence is real or fake.
Visual Pattern
Imagine you have a sequence of DNA:
Real: ?ATCGTAGCTAGCT
Fake: ATCCGTAGCTAGG
Here is the image of DNA Structure with real and fake pattern
Variational Autoencoders (VAEs) for Synthetic DNA
VAEs have three main parts:
Variational Autoencoders (VAEs) are another way to create new data by learning from existing data. Let us see how they work and how they can be used to make synthetic DNA sequences.
Structure and Function
VAEs have three main parts:
Encoder: Compresses the input data (long DNA sequencem e.g., ATCGGATC...) into a shorter vector (e.g., [0.2, -1.3, 0.8]) in the latent space
Latent Space: A compressed version of the input data, keeping only the most important features.
Decoder: Reconstructs the original data from the latent space.
How VAEs Work
Encoding
The encoder takes input data, like a DNA sequence, and maps it to the latent space. It creates two outputs for each data point: mean (μ) and variance (σ2) which defines a normal distribution.
Z~N(μ,σ2)
where Z?is a point in the latent space
Sampling
A point Z?is sampled from this normal distribution, ensuring the latent space follows a predefined distribution, usually Gaussian.
Decoding
The decoder takes this point Z?from the latent space and reconstructs it back into the original data format (e.g., a DNA sequence).
领英推荐
Mathematical Functions
VAEs are trained to minimize two types of errors:
Reconstruction Error
Reconstruction Error measures show, how well a model can recreate or "reconstruct" its input data after it has been processed by the model.?In the context of VAEs:
· Encoder:?Compresses the input data into a smaller, latent representation.
· Decoder:?Takes this latent representation and attempts to reconstruct the original data.
The goal of minimizing Reconstruction Error is to ensure that the reconstructed data is as close as possible to the original input data.?
This is measured by Mean Squared Error (MSE).
Mean Squared Error (MSE)?is a common metric used to evaluate reconstruction performance.
Where xi?is the original data point - This is the data as it was input into the model. For example, if your data is a DNA sequence, x^i ?would be a specific sequence from the original set.
x^i??the reconstructed data point - This is the output from the decoder, which aims to recreate the original data as closely as possible. For instance, if the original sequence was "ATCG", the decoder might output "ATGG".
KL Divergence
KL Divergence measures how one probability distribution differs from another, which we use as a reference. In VAEs, KL Divergence is used to quantify how much the learned latent space distribution deviates from a known, ideal distribution (usually a Gaussian distribution).
The formula for KL Divergence in the context of VAEs is:
Where μ and σ^2 are the mean and variance from the encoder and N\mathcal{N}N is the Gaussian distribution.
Here is what each term represents:
Q(z|X): This is the learned distribution of the latent variables z given the input data X. It is typically a Gaussian distribution parameterized by the mean μi and variance σi^2.
P(z): This is the prior distribution of the latent variables , which is usually a standard normal distribution (mean 0, variance 1)
μi: Mean of the latent variable ?zi???for the i??- ?th dimension.
σi^2: Variance of the latent variable zi for the i??- th dimension.
Using VAEs to Generate Synthetic DNA
Learn Compressed Representations: The VAE learns a compressed version of DNA sequences in the latent space.
Generate New Sequences: By sampling new points from the latent space and passing them through the decoder, the VAE can generate new DNA sequences with equivalent properties to the original sequences.
Example
Encoder
The encoder compresses a DNA sequence into a point in the latent space, e.g.,
Latent: (???= [0.5, -0.1], = [0.01, 0.01])
Decoder
A new point is sampled from this distribution, and the decoder reconstructs it: Reconstructed: ATCGTAGCTAGCT
New Sequences
By sampling different points from the latent space, new but similar sequences can be generated:
New Sequence: ATCGTAGCTTGCT
Simple Code Snippet
# DNA sequence to encode
original_sequence = 'ATCGTAGCTAGCT'
One-Hot Encoding: Each nucleotide (A, T, C, G) is encoded as a binary vector:
One-hot encoding is used to transform DNA sequences into a numerical format suitable for machine learning models. One-Hot Encoded DNA Sequence: The sequence ATCGTAGCTAGCT would be encoded as:
[
[1, 0, 0, 0], # A
[0, 1, 0, 0], # T
[0, 0, 1, 0], # C
[0, 0, 0, 1], # G
[1, 0, 0, 0], # A
[0, 1, 0, 0], # T
[0, 0, 1, 0], # C
[0, 0, 0, 1], # G
[1, 0, 0, 0], # A
[0, 0, 1, 0], # C
[0, 1, 0, 0], # T
[1, 0, 0, 0], # A
[0, 0, 1, 0] # C
]
# Define the latent space parameters (mean and variance)
latent_mean = np.array([0.5, -0.1])
latent_variance = np.array([0.01, 0.01])
New Sequences:
Sample 1: ATCGTAGCTAGCT
Sample 2: ATCGTAGCTAGCT
Sample 3: ATCGTAGCTAGCT\
Here is the simple Python code snippet:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np
# Function to encode DNA sequence into one-hot format
def encode_sequence(sequence):
# Initialize the label encoder and fit it to the nucleotide letters
label_encoder = LabelEncoder()
label_encoder.fit(list('ATCG'))
# Transform the sequence into integer-encoded format
integer_encoded = label_encoder.transform(list(sequence)).reshape(-1, 1)
# Initialize the one-hot encoder and fit it to the integer-encoded nucleotides
onehot_encoder = OneHotEncoder(sparse_output=False)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
return onehot_encoded
# Example usage
original_sequence = 'ATCGTAGCTAGCT'
encoded_sequence = encode_sequence(original_sequence)
# Define the latent space parameters (mean and variance)
latent_mean = np.array([0.5, -0.1])
latent_variance = np.array([0.01, 0.01])
# Output the original sequence and latent space parameters
print(f"Original Sequence: {original_sequence}")
print(f"Latent Mean: {latent_mean}")
print(f"Latent Variance: {latent_variance}")
# Dummy example for new sequences generated from latent space
# In a real scenario, these would be generated by the VAE's decoder
new_sequences = [
"ATCGTAGCTAGCT",
"ATCGTAGCTAGCT",
"ATCGTAGCTAGCT"
]
# Output the new sequences
print("New Sequences:")
for i, seq in enumerate(new_sequences, 1):
print(f"Sample {i}: {seq}")
Original Sequence: ATCGTAGCTAGCT
Latent Mean: [ 0.5 -0.1]
Latent Variance: [0.01 0.01]
New Sequences:
Sample 1: ATCGTAGCTAGCT
Sample 2: ATCGTAGCTAGCT
Sample 3: ATCGTAGCTAGCT
In summary, VAEs are used to generate synthetic DNA sequences by learning a compressed representation of real DNA. They generate new sequences by sampling from this learned representation, ensuring the new sequences have similar properties to the original data. This involves minimizing reconstruction error and ensuring the latent space follows a normal distribution.
The Impact on Research and Development
Generating synthetic biological data is changing research in many exciting ways. In drug discovery, researchers can use synthetic DNA to see how different genetic variations might affect a drug's effectiveness or safety. This helps them understand what might happen before testing on real people.
In personalized medicine, synthetic data is a game-changer. It helps create models that predict how individual genetic profiles might respond to treatments. This allows doctors to customize treatments for each person based on their unique genes. It makes healthcare more accurate and works better for everyone.
This technology helps progress in computational biology and bioinformatics. It gives scientists better data to train models that predict gene functions, protein structures, and more. With synthetic data, researchers can create new algorithms and improve existing ones because they have access to diverse and abundant information.
Overall, synthetic data is opening up new possibilities in science and medicine. It's making research faster, safer, and more efficient, ultimately helping to improve our understanding and treatment of diseases.
Practical Applications in Biology
Generative AI is making a big impact in biology with its ability to handle complex questions using mathematical functions.
· Protein Structure Prediction: AI models predict protein structures by learning from existing data. This helps researchers identify new drug targets and design proteins with specific functions.
· Synthetic Biology: AI is used to create synthetic gene sequences and metabolic pathways. This innovation leads to the engineering of organisms with new traits, advancing fields like biotechnology and creating new biofuels or medicines.
· Genomic Data Simulation: Generating realistic genomic sequences helps researchers test new methods without relying on hard-to-obtain real data. This is crucial for advancing research when actual data is limited.
Overall, generative AI is speeding up research, making it safer, and enhancing our ability to understand and treat diseases. It’s a powerful tool that opens up new possibilities in science and medicine.
Conclusion
Generative AI is changing how we do biological research and applications. By using mathematical tools like probability and optimization, these models are uncovering new insights in biology. The progress in these models shows how important math is for scientific discovery.
As we continue to explore how math and biology connect, remember that these mathematical ideas are not just abstract concepts. They are powerful tools that help us discover and create new things.
Thank you for joining me in exploring how generative AI and math are shaping biology.
References
1. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative Adversarial Nets. Advances in Neural Information Processing Systems, 27. https://dl.acm.org/doi/10.5555/2969033.2969125
2. Killoran, N., Lee, L. J., Delong, A., Duvenaud, D., & Frey, B. J. (2017). Generating and designing DNA with deep generative models. arXiv, arXiv:1712.06148. https://arxiv.org/abs/1712.06148
3. Lee, M. (2023). Recent Advances in Generative Adversarial Networks for Gene Expression Data: A Comprehensive Review. Mathematics, 11(14), 3055. https://doi.org/10.3390/math11143055 Submission received: 26 May 2023 / Revised: 7 July 2023 / Accepted: 10 July 2023 / Published: 10 July 2023. (This article belongs to the Special Issue Big Data and Bioinformatics).
4. Kingma, D. P., & Welling, M. (2019). An Introduction to Variational Autoencoders. Foundations and Trends? in Machine Learning, 12(4), 307-392. Available at: https://arxiv.org/abs/1906.02691
5. Hawkins-Hooker, A., Depardieu, F., Baur, S., Couairon, G., Chen, A., & Bikard, D. (2021). Generating functional protein variants with variational autoencoders. PLOS Computational Biology. https://doi.org/10.1371/journal.pcbi.1008736
6. Guo, X., Du, Y., Tadepalli, S., Zhao, L., & Shehu, A. (2021). Generating Tertiary Protein Structures via an Interpretative Variational Autoencoder. arXiv, arXiv:2004.07119. https://arxiv.org/abs/2004.07119
7. Hsu, C., Fannjiang, C., & Listgarten, J. (2024). Generative models for protein structures and sequences. Nature Biotechnology, 42(196-199). https://doi.org/10.1038/s41587-023-02115-w 4o
8. Way, G. P., & Greene, C. S. (2017). Extracting a Biologically Relevant Latent Space from Cancer Transcriptomes with Variational Autoencoders. bioRxiv. https://doi.org/10.1101/174474
9. Erfanian, N., Heydari, A. A., Ia?ez, P., Derakhshani, A., Ghasemigol, M., Farahpour, M., Nasseri, S., Safarpour, H., & Sahebkar, A. (2021). Deep learning applications in single-cell omics data analysis. Available at bioRxiv. https://doi.org/10.1101/2022.02.28.482392
10. Angermueller, C., P?rnamaa, T., Parts, L., & Stegle, O. (2016). Deep learning for computational biology. Molecular Systems Biology, 12(7), 878. doi: 10.15252/msb.20156651. PMCID: PMC4965871. PMID: 27474269.
11. Killoran, N., Lee, L. J., Delong, A., Duvenaud, D., & Frey, B. J. (2017). Generating and designing DNA with deep generative models. NIPS 2017 Computational Biology Workshop. arXiv:1712.06148 [cs.LG]. Available at: Generating and designing DNA with deep generative models.
12. Alipanahi, B., et al. (2015). Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology, 33(8), 831-838. Available at: Nature Biotechnology.
13. Qiu, Y.L., Zheng, H., & Gevaert, O. (2020). Genomic data imputation with variational auto-encoders. GigaScience, 9(8), giaa082. https://doi.org/10.1093/gigascience/giaa082.
14. Huang, L., Song, M., Shen, H., Hong, H., Gong, P., Deng, H.-W., & Zhang, C. (2023). Deep Learning Methods for Omics Data Imputation. Biology, 12(10), 1313. https://doi.org/10.3390/biology12101313
15. Creswell, A., & Bharath, A. A. (2018). Generative Adversarial Networks: An Overview. Foundations and Trends? in Machine Learning, 11(4), 317-363. Available at: https://arxiv.org/abs/1710.07035
16. Image Source 1&2: Generated with Python Programming, library used: matplotlib and NumPy
About the author:
Emmimal Alexander is an experienced web application development professional with a profound interest in artificial intelligence and machine learning. She has over seven years of experience in stock analysis and holds a Master's degree in Business Administration. Emmimal is currently working on AI-based projects and has expertise in Python, HTML, CSS, and JavaScript. You can learn more about her work on her online platform, emitechlogic.com, which is designed to help individuals learn and master AI and machine learning using Python, HTML, and JavaScript.
The article "Generative AI in Biology: Exploring Mathematical Functions Behind the Methods", published by Emmimal Alexander in the newsletter Mathematical Functions Describe the Biological World, issue 003, on July 15, 2024, was reviewed and edited by Leandro de Mattos Pereira and minor improvements were made.
Disclaimer
The content provided in the "Mathematical Functions Describe the Biological World" newsletter is for scientific dissemination. While the technologies discussed, such as Generative AI and synthetic DNA generation, hold significant promise, they must be used responsibly and ethically. It is crucial to ensure that these advancements benefit humanity and adhere to ethical guidelines in research, application, and both the national laws of individual countries and international law.
Additional Disclaimer: While we strive to ensure accuracy, the information may not be exhaustive, and we do not guarantee its completeness. We encourage readers to verify the information independently.
The opinions expressed are those of the authors and do not necessarily reflect the views of our organization. Readers are encouraged to conduct their own research, seek professional assistance, and not make decisions based on the information provided in this newsletter. We are not responsible for any errors, omissions, or any losses or damages arising from the use of this information. Additionally, we are not responsible for any unethical or illegal use of this information that contravenes national and international research standards and laws in any country.
Important Note: The potential of Artificial Intelligence (AI) to transform various fields, including biology, must be approached with caution and foresight. In 2023, over 1,000 AI researchers and technologists, including notable figures like Elon Musk and Steve Wozniak, signed an open letter calling for a pause on AI development due to ethical and safety concerns. This highlights the necessity of ongoing dialogue and regulation to ensure AI serves humanity's best interests. You can read the information about the letter at https://futureoflife.org/open-letter/pause-giant-ai-experiments/.
Sponsorship Opportunity
Sponsor our biweekly newsletter and help cover the substantial video production costs for Synthesia. Professionals will be invited to review articles ($30/hour), and contributors will be compensated for creating free AI and Bioinformatics courses for our YouTube channel, Databiomics. Sponsors will be acknowledged at the end of each article. Your sponsorship will be received as a donation to the author(s), enabling us to make knowledge free and accessible, positively impacting the world. Please contact us to sponsor and support our mission of Transforming Data into Knowledge.
For more information, please contact: [email protected]
Mathematical Modeling Expert | Published Researcher | 2 Review Articles on Diabetes Glucose Level Simulation | PhD Seeker in Applied Mathematics | Skilled in Predictive Analytics, Data Science, and Computational Methods"
8 个月Your Article are interesting. I will used your article references in my paper