Using Chat-GPT to Generate Structured Biological Knowledge
After my previous post on using Chat-GPT to explain biological findings, I was interested in digging in a bit more.? Specifically, I wanted to explore the reverse idea - rather than using Chat-GPT to provide interpretation on structured data, could I flip things around and use the tool to generate core lists for further analysis?
Anyone who has been in informatics at all is familiar with some of the vendor-provided tools like Qiagen’s Ingenuity IPA or Clarivate's Metacore. These are great tools for pathway analysis and interpretation, but rely on a core database of curated discoveries from the literature.? Protein A binds protein B, protein B phosphorylates protein C, and so on.? Text mining approaches are sometimes used to supplement these (or in some tools, replace them entirely), but I was wondering how Chat-GPT, a fairly general tool, could perform at these tasks.
The results were…mixed.? This was a case where I’d found that the OpenAI API models did not do nearly as well as Chat-GPT.? Sometimes they would recover ligands for a receptor, sometimes they would just dump out lists of similar receptors.? I found that Chat-GPT itself was more convincing (note that original output was a table, but that was not supported in this format of post):
What are proteins that bind to the protein CCR1??
That is a nice result (leaving aside the headers of "entrez gene and gene symbol" not being 100% accurate)…but without an API, I’m not exactly going to build my own interactome for all the protein-coding genes.? Worth revisiting once API-access to Chat-GPT becomes available.
领英推荐
Emboldened by this, I attempted another query - what if I could build a directory of disease-relevant animal models?
What are specific mouse models for multiple sclerosis, displayed as a table with columns for the model name, phenotype, and the reference?
Incredible! A list of models, some entirely new to me (although I don’t work on MS as an indication, so maybe that isn’t a surprise…), along with a rough description of phenotype that could allow for a good selection, and a reference.
But wait…what if we actually have a look at one of these references?? Perschon et al did actually write a paper in 1994 about IL-7Ra deficient mice. But it is cataloged under PMID 7964471.? The PMID referenced in the table is a totally different paper from 1994, that has some connection with neurons, but is not the reference that was presented.? The ICOS model’s reference is even farther off - the PMID points to paper analyzing the size of U.S. businesses.?
Maybe it is not surprising that models trained largely on narrative text do a bit better with extracting descriptions than some of these references, but it’s definitely a cause for caution. The kinds of errors that can creep in with Chat-GPT look very convincing.? I think for the immediate future, for scientific work, Chat-GPT might work more effectively when primed directly with the relevant facts (where it can then reformat them), rather than direct question answering - unless the asker is prepared to take caution in confirming the results.
AI and GenAI Evangelist | Startup Advisor
1 年Renier van Rooyen
AI and GenAI Evangelist | Startup Advisor
1 年Helpful. Thanks!
Experienced Data Scientist Focused on Computational Biology
1 年Addendum: Trying the same query with Perplexity.ai on mouse models returned well-referenced hits; just fewer in number and without the possibility of tabular formatting.
Scientific Software - Drug Discovery, Machine Learning, Bioinformatics
1 年I much agree. Using GPT-3 models directly for answering biomedical questions will likely to be somewhat inaccurate and will lack the traceability to the source.? But notice, that Microsoft is taking a slightly different approach with their Bing Chat integration. Here they are retrieving external content, which is then used for context when priming GPT-3 (the Microsoft "Prometheus" integration model). This means their content is always up-to-date - and they can provide real links to the sources used when answering the question - lowering the risk for providing false information. See the example below, where I repeated your query using Bing Chat - the references are correct and based on the latest published papers and not a frozen GPT model.
Program Director @ Boehringer Ingelheim | Healthcare Data and Analytics
1 年Fascinating, thanks for sharing this.