Nobel Prize, Generative AI & Lifesciences: Decoding how they come together and why!

Nobel Prize, Generative AI & Lifesciences: Decoding how they come together and why!

It's time for the next blog to dig into Lifesciences & AI and what better timing than now to unravel a key topic! In a first of its kind, the Nobel prize in Chemistry for 2024 was awarded to an outcome grounded in Generative AI, for the path breaking work in generating 3D structures of new proteins from their sequences computationally.

Demis Hassabis & John Jumper from Deepmind were co-awarded the prize for the creation of AlphaFold in 2020 with the third winner being David Baker , founder of Baker labs for his seminal work in denovo protein design using AI techniques back in 2003.

With all the buzz around AI and Generative AI and seeing that it is beyond hype with awards of the highest caliber, how much do we know about the different subtypes of models associated with it and the life sciences use cases that are a good fit for each type?

Read on to find out!


What is Generative AI?

What we already know: Re-establishing the context

Generative AI is a deep learning technique that encompasses various models which enable creation/generation of new objects (text, images, code) based on previously learned patterns from existing data. Deep learning techniques are based on Neural networks which is so named because it kind of mimics our thinking process inside our brain based on how the neurons are wired and activated. While text generation or NLP techniques have been around for years, the ability do it at a complex scale arrived with the transformer architecture and & BERT (we talked about this in the last newsletter). The strength of the generative AI system therefore not only depends on the corpus it was trained on and the parameters it could handle for "remembering" complex relationships within the corpus of data (e.g. co-occurrance of words in paragraphs of sentences). Which is why you often hear the size of the model be described in "billions of tokens" .

This much most of us would already know , as they have been the buzz word in the last years.

In general a generative AI system is one that learns to generate more objects that look like the data it was trained on and higher the "parameter" the more it resembles objects in real life (like a well written abstract)        

Beyond the basics: Diving deep into deep learning techniques

However there are 4 sub types within these category that define how this training and generation is done internally. They are:

  1. Transformer based models like (e.g. Generative pre-trained transformers or GPTs)
  2. Diffusion models (e.g. DALL-E)
  3. Generative Adversarial networks or GANs
  4. Variational Auto Encoders or VAEs

Each model varies based on the framework in which it was trained and excels at certain tasks.

Generative AI in the context of Machine learning

This article is to do a quick Level 1 overview of these techniques and allow us to see where in lifesciences they can benefit from. For details on what deep learning and neural networks mean, refer to Appendix at the bottom.


1. Transformer based models

What are they?

Transformers, such as GPT-3 and GPT-4, are designed primarily for natural language processing. They excel at understanding context and generating coherent text, making them ideal for applications like chatbots, content creation, and translation. Transformers also unlocked a new notion called attention that enabled models to track the connections between words across pages, chapters and books rather than just in individual sentences. (Read: the original seminal paper that introduced the notion of transformers that was interestingly titled "Attention is all you need" ).

Large Language models (LLMs) fall under this category and its name is precisely because of the billions of parameters it can handle during training. While predominantly popular with language generation, transformers can also be used for generation of proteins, chemicals and code (like text2sql).

Transformers are unique because of the self-attention mechanism that processes all parts of the input simultaneously, allowing them to capture relationships between distant elements in the data. This architecture is particularly effective for sequential data like text

Popular Examples in general

  • In general the various transformer architecture excel at tasks that serve to remember and track correlations based on data and connections, it has already seen during training. The most popular models in this space are the ones we are the most familiar with : GPT-4, Llama 2, Mistral, Gemini, etc.
  • The most common use cases include developing of chatbots , code generation and help with content authoring across many different use cases (like emails, blogs, summaries). Most of these models can also now handle multi modal data and can be used in conjunction with other type of Gen AI techniques to bolster performance.

In Lifesciences: generation of protein structures and beyond- what is different?


Image Copyright: https://www.nobelprize.org/uploads/2024/10/fig1_ke_en_24_A.pdf

  1. With the recent Nobel prize for Chemistry being awarded to the developers of AlphaFold for the work in accelerating protein structure prediction, that the importance of transformer based models and its impact in the industry stands paramount. AlphaFold , a transformer-based model was developed by DeepMind and first presented to the world in 2020. Prior to the development of AlphaFold, predicting structures from protein was an intense and time consuming process that took weeks in the labs leveraging techniques such as ,X-ray crystallography, nuclear magnetic resonance and electron cryomicroscopy (cryo-EM). For a deeper view into the life before and after AlphaFold, here is a great review article on this topic to understand its transformative impact.
  2. Since AlphaFold is designed for a specialized task in computational biology and not for natural language tasks, it isn't considered an LLM in the traditional sense like BERT or GPT-4. However, it does share some similarities with LLMs because both rely on transformer architectures and attention mechanisms to process sequences—in AlphaFold's case, to predict the 3D structure of proteins from their amino acid sequences instead of word's or sentences. Developing such a model is no ordinary feet and the training involved about 170,000 protein structures from the Protein Data Bank (PDB) and hundreds of millions of protein sequences from large sequence databases such as UniProt, which contains sequences without known structures.

In a simplistic sense within AlphaFold's architecture, transformers are used to capture both the sequence-based and structure-based relationships, utilizing the attention mechanism to learn how residues in the protein sequence interact spatially and structurally. This approach, combined with other innovations, has resulted in AlphaFold's remarkable accuracy in protein structure prediction.         

3. Besides protein structures, traditional text based generative AI models have been used in quite a number of other use cases with regards to content generation , including regulatory content authoring for helping with submissions like clinical study report summaries, clinical protocols, MLR compliant documents,etc. In most cases, creation of such nuanced documents will require fine tuning of general purpose models for higher accuracy but we will cover that in a follow up blog. Figure 2 in Appendix has a list of the models and their implementation patterns.


2. Diffusion models

What are they?

Diffusion models generate new data by reversing a diffusion process, i.e., information loss due to noise intervention. The main idea here is to add random noise to data and then undo the process to get the original data distribution from the noisy data. This process generally works well for generation of multi modal data like images, video and audio.

In very simplistic terms, in a diffusion model, there is a two step process in training by which the model learns. The step 1 is to add noise to it and the step 2 is to remove the added noise (or denoise it).

This way, the model learns how to construct the data and it is extremely powerful especially in image generation use cases. They are employed in various applications, including text-to-image generation (as seen in models like DALL-E 2 and Midjourney) and other complex generative tasks.

While traditional diffusion models are not inherently transformer-based, recent innovations have successfully combined them with transformer architectures to enhance their capabilities in generating high-quality images and other data types.

Popular examples

DALL-E the popular image generation model from Open AI leverages diffusion technique for image generation from text prompts. It generates images from text descriptions, allowing users to create unique visual content based on their prompts.

In Lifesciences: Beyond generating images and vidoes

Besides image generation, there are two examples of how this comes to life within lifesciences and both are within the field of drug discovery.

  1. Protein molecule binding with DiffDock: Diffdock is a cutting-edge diffusion generative model specifically designed for molecular docking, a critical task in drug discovery that involves predicting how small molecule ligands bind to proteins. In the past this was done using traditional machine learning using popular algorithms like Schrodinger's FEP. Whether this will scale to real applications , is something that is in the realm of promises. Nvidia's BioNemo framework has Diffdock setup for enterprise scale and something that will be interesting to try and compare with traditional methods.
  2. Predicting novel compounds with LDM: Recently Terray Therapeutics published a paper about their proprietary algorithm for designing new chemicals using a diffusion framework called latent diffusion. As described in this paper, while designing a drug some critical properties, like pharmacokinetic features determining bioavailability, are signi?cantly more dif?cult to gather data for than others. This makes it difficult for pure transformer based models to predict a drug candidate with chemical properties that are grounded in reality and not a product of hallucinations. Terray have used a combination of an auto encoder and diffusion models to create their algorithm which is used to design denovo chemical compounds that can fit the specific needs. For an avid machine learning enthusiast here is the detailed paper that describe it in detail.


3. Generative Adversarial networks (GANs)

What are they?

Image credit:

Generative adversarial networks popularly called as GANs consist of two neural networks: a generator that creates new data and a discriminator that evaluates its authenticity. This adversarial process helps GANs produce high-quality images, videos, and other multimedia artifacts by learning from a dataset's features. The generator aims to produce data indistinguishable from real samples, while the discriminator goal is to differentiate between real and generated data. This adversarial training process allows the generator to produce increasingly realistic data over time.

Popular uses

GANs are primarily used for generating non textual content or to augment data in the absence of many data points, while transformers excel in text and sequential data processing.

Specific implementations of GANs, such as TimeGAN, focus on generating synthetic time-series data. These models account for time related changes (temporal variations) and correlations in the data, making them useful for applications like financial modeling and supply chain.

In Lifesciences

  1. Synthetic control arm for real world analytics : Personal health and behavior data are valuable for health research but are often inaccessible due to privacy concerns and legal restrictions. Early access to data samples is needed for secondary analysis and one solution is to generate synthetic data that mirrors real data in structure and statistics thus circumventing the needs for privacy and legal bureaucracy. A key challenge in generating synthetic data sets is to ensure that the synthetic data maintains real-world correlations fund in real patient data sets for e.g. positive correlations between daily physical activity and mobility . Another key aspect would be to ensure that imbalanced datasets like rare disease populations where there is not enough original data are also handled in a statistically relevant manner for augmentation. The research paper, leverages a variation of GAN for tabular data synthesis called DP-CGANS framework (Differentially Private-Conditional Generative Adversarial Networks).
  2. Synthesizing medical images: Here is brilliant video illustrating the use of GAN for synthesizing medical images. As mentioned before GANs are very effective in augmenting high dimensional data like images and this makes it an ideal candidate for image generation like chest Xrays and even format conversions (for e.g. from tiffs to png). GANs also come with hallucinations and therefore while images can be synthesized for exploratory training needs, using them for image conversion may come with its own challenges.
  3. Privacy Preservation across the enterprise: Synthetic data, in general is used for training machine learning models in lieu of real data in cases where the real data is sparse or when you do not want the extended teams performing model training and inference on real identifiable data for the sake of privacy protection. These use cases can extend across the value chain of pharma from supply chain to commercial Next best actions. The aim is to train ML models on synthetic data before setting up the inference on the actual production version, thus limiting access to production data.


4. Variational Auto encoders

What are they?

Variational Auto encoders (VAE's) are similar to transformers and comprise an encoder that compresses input data into a latent space and a decoder that reconstructs the original data from this latent representation. While in GANs both the generator and discriminator are trained simultaneously, the training here is done in a step-wise fashion. This architecture allows VAEs to generate new data that resemble the training data, making them a powerful tool for generative tasks. A lot of the embedding creation that we talked about in our previous newsletter, used some form of auto encoder techniques. While the encoder can be used to create embeddings, the decoder can be leveraged for many tasks like denovo sequence generation of a protein or a small molecule or even a new image depending on the nature of the data being trained on.

Popular uses

While Auto encoders are used similar to GANs in generating synthetic data or image generation where the underlying data is sparse or rare, it is easier to train a VAE compared to a GAN due to having a less complex architecture. A significant challenge with GANs is model collapse, where the generator produces a limited variety of outputs instead of capturing the full diversity of the training data. This occurs when the generator finds a shortcut to fool the discriminator, leading to repetitive or similar outputs. Therefore the VAE's are used more stable for such needs. As delineated in the previous newsletter VAE's are also used for embedding creations as they can collapse a complex object into a latent space and where the underlying structure of the data becomes important.

In Lifesciences

  1. VAEs for Drug discovery: Besides use as encoders for embeddings, VAE has been considered for drug discovery tasks especially to model drug protein interactions. While I have not been able to find more exhaustive articles or come across solutions that implemented this in a full fledged mannger in life sciences, there seems to be many exploratory areas that consider VAE again for drug discovery tasks.
  2. In this article , researchers have recommended the use of VAEs to be combined with attention mechanisms and convolutional neural networks (CNNs) to enhance predictive accuracy in drug-protein interaction (DPI) predictions. This integration allows VAEs to extract meaningful features from both drugs and proteins, improving overall model performance


In conclusion, I hope this introduction to AI models and their life sciences applications, has opened your eyes to the diverse nature of Generative AI and their real-world impact. Hope this blog sparked your curiosity and expanded your understanding!

"Attention is all you need" was the title of the original paper that led to the evolution of Generative AI and transformers - so stay tuned & "attentive" as we will return with more insights in our next deep dive on AI's transformative role in life sciences!


Appendix

Figure 1: Defining key terms for Gen AI



Figure 2: Transformer models and use in Lifesciences


Interesting Reads

https://deepmind.google/discover/blog/demis-hassabis-john-jumper-awarded-nobel-prize-in-chemistry/

https://www.dhirubhai.net/pulse/harnessing-ai-embeddings-new-era-life-science-its-gopalakrishnan-pfaue/?trackingId=Bd6Jwd%2F%2BzETApeF8zxbADQ%3D%3D

https://arxiv.org/abs/1706.03762

https://pyimagesearch.com/2021/09/13/intro-to-generative-adversarial-networks-gans/



secondopinionfromai.com AI fixes this (AI Medical Reviews) AI's potential in life sciences.

Sundar Varadaraj Perangur

Startup Mentor, Product Innovator, Transforming Orgs, Gamechanging Tech - GenAI, IoT, Building Great Teams

4 个月

An excellent compendium Harini ! Our Startup intuceo.com does cuttingedge work in GenAI. Would like to discuss with you.

回复
Dalena Bressler

Director of Sales, North Star Scientific A life science sales agency helping brands accelerate growth within the biotech, pharma and CRO space. Quality lead generation is what sets us apart.

4 个月

AI revolutionizing lifesciences, transforming drug discovery pipelines.

回复
Arjun Rajagopal

Snowflake | Data Engineering | Analytics | Data Strategy | AI Strategy | Governance | X-Microsoft | X-Teradata | X-Oracle | TiE Member - Delhi NCR Chapter

4 个月

Very informative article Harini Gopalakrishnan

要查看或添加评论,请登录

Harini Gopalakrishnan的更多文章

社区洞察

其他会员也浏览了