Gene-associated Disease Discovery using LLM
Image illustrates using genomics in drug discovery

Gene-associated Disease Discovery using LLM

This paper [2401.09490] Gene-associated Disease Discovery Powered by Large Language Models (arxiv.org) published in Jan 2024 describes the framework for disease discovery that is associated with gene alterations using LLM.

The framework employs Large Language Models (LLMs) in this case GPT-4 for the discovery of diseases associated with specific genes. This framework aims to automate the labor-intensive process of sifting through medical literature for evidence linking genetic variations to diseases, thereby enhancing the efficiency of disease identification. The approach involves using LLMs to conduct literature searches, summarize relevant findings, and pinpoint diseases related to specific genes.

Current process

The physician usually searches for evidence in the medical literature that somehow is relevant to the genetic variations of interest, then analyzes the evidence related to each of the variations and identifies the potential disease the patient may have.


Depiction of disease discovery process in clinical practice. It begins with the patient (A) visiting a clinic and undergoing genetic sequencing (B). The physician (C) then analyzes the sequencing results to pinpoint suspicious genetic variations. Subsequently, the physician searches databases or medical literature (D) for records pertinent to these specific genes (E). Finally, the potential disease related to these genes is identified. Our framework is designed to automate the labor-intensive steps from (D) to (F)

Current challenges

The task of sifting through literature for evidence is exceedingly laborious, given the potential existence of thousands of papers concerning a specific gene. The researcher is tasked with the meticulous job of pinpointing those documents that specifically contain insights demonstrating the association of the gene with a particular disease. This process demands significant time and attention to detail, as it involves discerning the most relevant and informative studies from a vast sea of academic research.

Proposed solution in this paper

Framework powered by LLMs for discovering diseases associated with specific genes. This framework is capable of conducting a literature search based on specified genes, summarizing the retrieved literature, and identifying diseases related to the input genes. Utilizing this framework, the extensive and complex process of literature retrieval and summarization to identify potential diseases from specific genes can be significantly streamlined and automated.

Framework of the proposed method. The framework starts from specific genes suspicious to cause disease of the patient. Then the PubMed API is leveraged to search literature regarding these genes by criteria such as relevance or time. Top K papers are then selected and queried based on crafted prompts by LLMs (e.g., GPT-4). During this phase, the content of the literature is analyzed by LLMs. Relevant diseases are identified and ranked through the in-context learning capabilities of Large Language Models LLMs. This process is iterated several times, with diseases being re-ranked based on the frequency of their occurrence in the outputs

Paper review

  1. Gen AI could be used to reduce the time and burden of performing literature reviews in various research-related fields. This is one such example where Gen AI has proven useful.
  2. The paper talks about RAG (Retrieval Augmentation Generation) techniques but does not go in too much detail on how this was performed. Parsing of documents, cracking, chunking, vectorization, and vector store updates are some of the important retrieval concepts in unstructured data RAG that need to be given importance for more accurate results.
  3. Langchain was used for orchestration. The authors have not specified the version of Langchain used. One must be careful while using Langchain as there were security vulnerabilities found for versions up to 0.0.131 - CVE-2023-29374: LangChain Code Injection Vulnerability (vulert.com)
  4. Grounding is a very important concept very relevant to healthcare. Grounding refers to the ability to connect model output to verifiable sources of information. By providing models with access to specific data sources, grounding tethers their output to concrete data, reducing the chances of inventing content. Accuracy and reliability are paramount in various scenarios, such as financial reporting and health reporting. Grounding ensures that the model’s responses are anchored to specific information, enhancing their trustworthiness and applicability. Reduce model hallucinations (Instances where the model generates content that isn’t factual), anchoring model responses (Tying them to specific, verifiable data), and Enhancing trustworthiness (Ensuring the generated content aligns with reliable sources) are some of the important benefits of grounding. A beautiful article to be referred to understand the importance of grounding is Why Grounding of Generative AI Matters - by Hadas Bitran (substack.com) - Credits Hadas Bitran
  5. The most powerful gen AI model GPT-4 was used for this experiment. Similar results could be achieved if tried on some advanced SLM (Small Language Models) like Phi-2 (Phi-2: The surprising power of small language models - Microsoft Research) which is a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters.
  6. Another LLM that could be used would be BioGPT: generative pre-trained transformer for biomedical text generation and mining | Briefings in Bioinformatics | Oxford Academic (oup.com) which is a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining.

Conclusion

The framework in the paper automates the labor-intensive process of discovering diseases associated with specific genes. It would be beneficial to relook at the retrieval techniques. There are plug-and-play tools that can be used for gen AI solution orchestration. It is very important to look at security, responsible AI measures, modern LLM evaluation techniques, and grounding while proposing any gen AI solutions. It is interesting to see that more and more research is being done in implementing solutions that are catering to genomics and drug discovery study areas which is both encouraging and fascinating.

要查看或添加评论,请登录

Manoj Kumar的更多文章

  • Value-Based Competition in Healthcare

    Value-Based Competition in Healthcare

    Read time [5 mins] Michael Proter, the management guru in his book, "Redefining Healthcare" describes a win-win…

    1 条评论
  • Snakemake pipeline on Azure CycleCloud (Part 2)

    Snakemake pipeline on Azure CycleCloud (Part 2)

    Read time [5 mins], Experimentation time [1 hr.], Audience - Bioinformatician, Data engineers.

    1 条评论
  • Snakemake pipeline on Azure CycleCloud (Part 1)

    Snakemake pipeline on Azure CycleCloud (Part 1)

    Read time [10 mins], Experimentation time [1 hr.], Audience - Bioinformatician, data engineers Disclaimer: I am not a…

    4 条评论
  • Evolution of Health IT standards

    Evolution of Health IT standards

    Here is a timeline view of how healthcare IT standards evolved over time. Created this timeline view using content from…

  • Product mindset to data & information

    Product mindset to data & information

    "Data products", "Information products" or "data & information products" are often interchangeably used and are often…

  • Basics of Universal Healthcare

    Basics of Universal Healthcare

    The following blog covers basics of universal health care, countries that work on this kind of healthcare, advantages…

  • Power of Association

    Power of Association

    [Time to Read - 3 mins] They say the books we read and the people we associate with will determine where we will be 5…

  • How can technology help kids with Down Syndrome?

    How can technology help kids with Down Syndrome?

    [Update 30th July 2020] - Wow, what an amazing experience working with folks from Gigi's, Evolv Rehab and Microsoft. We…

    16 条评论
  • Brief history of U.S. Healthcare

    Brief history of U.S. Healthcare

    Last month, I presented to an internal audience on the future trends in data and artificial intelligence shaping…

    2 条评论
  • Genomics learning resources

    Genomics learning resources

    If you are like me, you would probably take interest in progress happening around analytics in healthcare. I am…

社区洞察

其他会员也浏览了