A New One-stop LLM for Chemical and Biomedical Tasks
In a new paper, researchers from clinical stage artificial intelligence (AI)-driven drug discovery company Insilico Medicine dicine, in collaboration with 英伟达 , presented a new large language model (LLM) transformer for solving biological and chemical tasks called nach0.
The multi-domain and multi-task LLM was trained on a diverse set of tasks, natural language understanding, synthetic route prediction, and molecular generation, and works across domains to answer biomedical questions and synthesize new molecules.
The findings were published in Chemical Science Journal .
While there are other LLMs designed for biomedical discovery, including BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) and SciFive, these datasets rely mainly on biomedical natural language texts, such as drugs, genes, and cell line names, but do not contain chemical structure descriptions.
Those that have emerged with both text and chemical structure descriptions, such as Galactica, have not yet been trained for diverse chemical tasks.?
Nach0 seeks to bridge this gap for the first time. It draws from a dataset that includes abstract texts extracted from PubMed and patent descriptions derived from the U.S. Patent and Trademark Office related to the chemistry domain – 100 million documents that became 355 million tokens worth of abstracts and 2.9 billion patents, as well as molecular structures using simplified molecular-input line-entry system (SMILES).
To train the system, researchers turned this chemical information into tokens as well – 4.7 billion – and then annotated these tokens with special symbols.?
Using this dataset, researchers trained nach0 to perform three key tasks: natural language processing, such as document classification and question answering; chemistry-related tasks, such as molecular property prediction, molecular generation, and reagent prediction; and cross-domain tasks, including description-guided molecule design and molecular description generation.
“Nach0 represents a step forward in automating drug discovery through natural language prompts,” says Alex Zhavoronkov, PhD, founder and CEO of Insilico Medicine. “In the future, we foresee the potential inclusion of protein sequences with their own special tokens as well as fine-tuning the model in order to accommodate new modalities and exploring the fusion of information from text and knowledge graphs.”
Nach0 is built on the NVIDIA BioNeMo generative AI platform, enabling training and scaling of drug discovery applications. Specifically, the training was performed using NVIDIA NeMo, an end-to-end platform for developing custom generative AI. The research team leveraged NLP capabilities to train and evaluate the new model’s LMs. NVIDIA’s memory-mapped data loader modules allowed researchers to manage large datasets with small memory footprints and optimal reading speed.?
Measured against other LLMs used for biomedical understanding, such as FLAN, SciFive, and MolT5, nach0 was found to have distinct advantages when performing molecular tasks using molecular data, and it significantly outperformed ChatGPT.?
Researchers tested nach0’s capabilities in two case studies.
领英推荐
The first was to generate molecules that could be effective against Diabetes mellitus. Researchers entered the prompt
“discover biological targets with potential therapeutic activity, analyze the mechanism of action, generate molecular structure, propose one-step synthesis, and predict molecular properties.”
They generated 200 SMILES on the molecule generation prompt and selected one structure as the most promising from a chemical expert knowledge perspective.
They also applied nach0 to a case study used as a demo for Insilico’s Chemistry42 generative AI drug design platform, with the model returning 8 molecules satisfying the prompt in just 15 minutes for generation and 30 minutes for scoring in Chemistry42.?
“We anticipate that as nach0 evolves, it will require less supervision, and it will be able to simply generate and validate promising therapeutic options for medicinal chemists,” says Maksim Kuznetsov, a senior research scientist at Insilico and one of the paper’s lead authors.?
---
Welcome to my newsletter, "Where Technology Meets Biology"!
Here, I am sharing noteworthy news, trends, biotech startup picks, industry analyses, and interviews with pharma KOLs. Contact me for consulting or sponsorship opportunities here or at www.BiopharmaTrend.com .
Enjoying the newsletter? Subscribe to become part of the 15K+ readers here on LinkedIn. Please help us spread the word by sharing it with your colleagues and friends.
Also, consider joining my Substack community, where we are exploring a lot more (5.5K+ industry professionals are reading it via weekly email).
-- Andrii
Medical Research Project Manager | AI-driven Medical Technology | Clinical Trial Ethics | Bridging Teams & Innovating in Medicine
5 个月Thank you Andrii for this great summation. This is a big step forward in AI and drug discovery! Nach0 can handle both natural language and chemical data, making biomedical research more precise. While AI can speed up drug discovery, we need to think about how to balance this with the valuable insights from human experts. It's important that AI helps doctors and scientists rather than replaces them. How can we make sure these powerful tools work well with the knowledge of healthcare professionals?
Data Scientist | AI for Resource Exploration, Healthcare, agriculture
6 个月Any website link to nacho
Exited founder turned CEO-coach | Helping founders scale their companies without sacrificing themselves.
6 个月LLM revolutionizing multi-domain biomedical applications. Groundbreaking capabilities. Exciting times ahead
Data Analyst at Wynum Automation Services Pvt.Ltd
6 个月Interesting! Need to try nach0