Using Chat-GPT to Generate Structured Biological Knowledge
Generated with DALL-E2

Using Chat-GPT to Generate Structured Biological Knowledge

After my previous post on using Chat-GPT to explain biological findings, I was interested in digging in a bit more.? Specifically, I wanted to explore the reverse idea - rather than using Chat-GPT to provide interpretation on structured data, could I flip things around and use the tool to generate core lists for further analysis?


Anyone who has been in informatics at all is familiar with some of the vendor-provided tools like Qiagen’s Ingenuity IPA or Clarivate's Metacore. These are great tools for pathway analysis and interpretation, but rely on a core database of curated discoveries from the literature.? Protein A binds protein B, protein B phosphorylates protein C, and so on.? Text mining approaches are sometimes used to supplement these (or in some tools, replace them entirely), but I was wondering how Chat-GPT, a fairly general tool, could perform at these tasks.


The results were…mixed.? This was a case where I’d found that the OpenAI API models did not do nearly as well as Chat-GPT.? Sometimes they would recover ligands for a receptor, sometimes they would just dump out lists of similar receptors.? I found that Chat-GPT itself was more convincing (note that original output was a table, but that was not supported in this format of post):

What are proteins that bind to the protein CCR1??

  • Entrez Gene; Gene Symbol; Relation Type
  • CXCL12; CXCL12; Ligand
  • CCL5; CCL5; Ligand
  • CCL3; CCL3; Ligand
  • CCL4; CCL4; Ligand
  • CXCL4; PF4; Ligand
  • CXCL8; IL8; Ligand
  • CCL15; CCL15; Ligand
  • CCR1; CCR1; Homodimerization
  • G protein subunits; GNAI2, GNAI3, GNAQ; Signal transduction
  • β-Arrestin; ARRB1, ARRB2; Signal transduction

That is a nice result (leaving aside the headers of "entrez gene and gene symbol" not being 100% accurate)…but without an API, I’m not exactly going to build my own interactome for all the protein-coding genes.? Worth revisiting once API-access to Chat-GPT becomes available.


Emboldened by this, I attempted another query - what if I could build a directory of disease-relevant animal models?


What are specific mouse models for multiple sclerosis, displayed as a table with columns for the model name, phenotype, and the reference?

  • Model Name; Phenotype; Reference
  • Experimental autoimmune encephalomyelitis (EAE); Paralysis, demyelination, and inflammation in the central nervous system; Stromnes and Goverman, 2006 (PMID: 16815317)
  • Theiler's murine encephalomyelitis virus-induced demyelinating disease (TMEV-IDD); Demyelination and inflammation in the central nervous system; Lipton and Miller, 2005 (PMID: 15872086)
  • Cuprizone-induced demyelination model; Demyelination in the central nervous system; Matsushima and Morellini, 2001 (PMID: 11676928)
  • Chronic relapsing experimental autoimmune encephalomyelitis (CREAE); Relapsing-remitting paralysis, demyelination, and inflammation in the central nervous system; Lyons et al., 1999 (PMID: 10207173)
  • Spontaneous autoimmune encephalomyelitis (S-AE); Paralysis, demyelination, and inflammation in the central nervous system; Huseby et al., 2001 (PMID: 11514607)
  • Myelin oligodendrocyte glycoprotein (MOG)-induced experimental autoimmune encephalomyelitis (EAE); Paralysis, demyelination, and inflammation in the central nervous system; Bettelli et al., 2003 (PMID: 12914783)
  • Proteolipid protein (PLP)-induced experimental autoimmune encephalomyelitis (EAE); Paralysis, demyelination, and inflammation in the central nervous system; Sun et al., 2001 (PMID: 11356892)
  • T cell receptor transgenic (TCR-Tg) mice; Paralysis, demyelination, and inflammation in the central nervous system; Anderson et al., 2003 (PMID: 12829604)
  • Inducible costimulator (ICOS)-deficient mice; Reduced severity of paralysis, demyelination, and inflammation in the central nervous system; McAdam et al., 2001 (PMID: 11546870)
  • Interleukin-7 receptor alpha (IL-7Rα)-deficient mice; Reduced number of T cells and less severe paralysis, demyelination, and inflammation in the central nervous system; Peschon et al., 1994 (PMID: 8001160)


Incredible! A list of models, some entirely new to me (although I don’t work on MS as an indication, so maybe that isn’t a surprise…), along with a rough description of phenotype that could allow for a good selection, and a reference.


But wait…what if we actually have a look at one of these references?? Perschon et al did actually write a paper in 1994 about IL-7Ra deficient mice. But it is cataloged under PMID 7964471.? The PMID referenced in the table is a totally different paper from 1994, that has some connection with neurons, but is not the reference that was presented.? The ICOS model’s reference is even farther off - the PMID points to paper analyzing the size of U.S. businesses.?


Maybe it is not surprising that models trained largely on narrative text do a bit better with extracting descriptions than some of these references, but it’s definitely a cause for caution. The kinds of errors that can creep in with Chat-GPT look very convincing.? I think for the immediate future, for scientific work, Chat-GPT might work more effectively when primed directly with the relevant facts (where it can then reformat them), rather than direct question answering - unless the asker is prepared to take caution in confirming the results.

Dr Raminderpal Singh

AI and GenAI Evangelist | Startup Advisor

1 年
回复
Dr Raminderpal Singh

AI and GenAI Evangelist | Startup Advisor

1 年

Helpful. Thanks!

回复
Jon Hill

Experienced Data Scientist Focused on Computational Biology

1 年

Addendum: Trying the same query with Perplexity.ai on mouse models returned well-referenced hits; just fewer in number and without the possibility of tabular formatting.

  • 该图片无替代文字
回复
Mikael Hvidtfeldt Christensen

Scientific Software - Drug Discovery, Machine Learning, Bioinformatics

1 年

I much agree. Using GPT-3 models directly for answering biomedical questions will likely to be somewhat inaccurate and will lack the traceability to the source.? But notice, that Microsoft is taking a slightly different approach with their Bing Chat integration. Here they are retrieving external content, which is then used for context when priming GPT-3 (the Microsoft "Prometheus" integration model). This means their content is always up-to-date - and they can provide real links to the sources used when answering the question - lowering the risk for providing false information. See the example below, where I repeated your query using Bing Chat - the references are correct and based on the latest published papers and not a frozen GPT model.

  • 该图片无替代文字
George Okafo, Ph.D., BSc., ARCS

Program Director @ Boehringer Ingelheim | Healthcare Data and Analytics

1 年

Fascinating, thanks for sharing this.

要查看或添加评论,请登录

Jon Hill的更多文章

  • Partners in Science: Evolving from Student to Scientific Leader

    Partners in Science: Evolving from Student to Scientific Leader

    At Boehringer Ingelheim, our commitment to engaging with our local communities is deeply ingrained in our corporate…

    1 条评论
  • Summarization and Prompting

    Summarization and Prompting

    I recently came across a preprint from Griffin Adams et al that covered a new approach called Chain of Density for…

    2 条评论
  • Don’t Confuse Consistency with Quality

    Don’t Confuse Consistency with Quality

    Earlier this summer, I’d decided that it would be a good idea to learn Microsoft Power BI. This is a tool used to…

    6 条评论
  • Leading Change

    Leading Change

    During recent travel, I had the opportunity to read Leading Change, by John Kotter, which is a sort of "business…

    1 条评论
  • What if LLMs are GOOD for security?

    What if LLMs are GOOD for security?

    I had recently shared some thoughts on appropriate security access for LLMs on confidential data, but what if LLMs…

    1 条评论
  • LLMs and Sensitive Data

    LLMs and Sensitive Data

    My colleague Victoria Gamerman, PhD recently shared an article from Tamer Chowdhury about architecture for using…

    1 条评论
  • The Six (Prompting) Hats

    The Six (Prompting) Hats

    I had previously shared some impressions on the Six Thinking Hats method which was recommended by a colleague as a way…

    1 条评论
  • Seeing Images in Single Cell Data (Pareidolia)

    Seeing Images in Single Cell Data (Pareidolia)

    This post will describe a bit of an unusual application for generative AI. To be honest, I’m still not sure if it falls…

  • ChatG-PPi-T: Finding Interactions with OpenAI

    ChatG-PPi-T: Finding Interactions with OpenAI

    In an earlier article, I’d posted about some mixed results in using the different LLMs provided by OpenAI to answer…

    2 条评论
  • PowerPoint to Email with OpenAI

    PowerPoint to Email with OpenAI

    I was having a conversation with a colleague during his recent visit to the U.S.

    9 条评论

社区洞察

其他会员也浏览了