AI’s Role in Precision Oncology - A 6-year Case Study
Oncologists often face difficult decisions when their patients progress beyond established treatment guidelines. When founding xCures in 2018, our mission was to develop AI-powered software that could support them in navigating these complex cases with data-driven, personalized insights. Our XCELSIOR protocol—an IRB-approved, patient-centric outcomes registry—was designed specifically to collect and structure detailed, actionable data from patients who move beyond standard guidelines and to learn from what oncologists were doing at the frontier of medicine. Six years in, we’re excited to share what we’ve learned about AI’s potential to help bridge the gap in precision oncology and enable more informed, individualized treatment strategies.
The Complexity of Cancer Treatment
Cancer is incredibly complex. To quantify this, we examined the growth of the National Comprehensive Cancer Network (NCCN) Guidelines. From 1996 to 2020, the number of guidelines increased from eight to seventy-three, representing a compound annual growth rate of 9.7%. Similarly, the average length of the guidelines expanded from 40 pages in 2004 to 200 pages in 2020. This amounts to approximately 3,000 pages of new material each year.
This growth reflects enormous success in understanding cancer and developing new treatments. However, it also means that a general medical oncologist must absorb a vast amount of new information annually and incorporate it into their daily practice.
In parallel, we witnessed tremendous growth in clinical genomic sequencing, introducing a large amount of potentially relevant new genomic information. We did a similar analysis of genes and gene variants with diagnostic, prognostic, or therapeutic significance from the CiVIC database, which showed even more rapid growth than the NCCN Guidelines.? Both of these findings are shown in Table 1, below.
The Challenge of Information Overload
The sheer volume of new information presents a significant challenge for oncologists. Information retrieval—the process of finding relevant information—is one of the most critical yet least visible areas of artificial intelligence (AI) today. A canonical example of effective information retrieval is your search bar. Search technology transformed the early internet from a playground for computer scientists into a useful tool for everybody.
In 2020, and even more so today, search and recommendation engines play an outsized, but mostly invisible, role in our lives. It was from this domain that we initially set out to build an AI-powered clinical decision support tool for oncologists.? In fact, we heard from patients and oncologists that online search tools were a key resource and there was (and remains) a resource gap for quality oncology information. ? Among oncologists, besides search tools, we heard that group chats and informal peer-to-peer “curbside consults” remain important to their practice when dealing with complex or unusual cases.???
Developing an AI-Powered Recommender System
Recommender systems differ from traditional statistical and machine learning (ML) systems, although modern implementations often incorporate the latest technologies. Unlike systems that classify or predict based on existing data, recommender systems aim to interpolate missing data. For example, predicting what movies you might like involves understanding your preferences and comparing them to those of others with similar tastes.
Applying these concepts to cancer treatment requires understanding and encoding (or embedding) both patient features and treatment features. Our first step was to design a recommender system that could surface relevant treatment options for oncologists by accurately representing both patient and treatment characteristics.
Leveraging Clinical Trials as a Knowledge Base
While much has been written about using AI/ML to match patients to clinical trials, one of our key insights was recognizing that clinical trials themselves encode a wealth of information about the current and changing treatment landscape and emerging therapies. Clinical trials outline detailed inclusion and exclusion criteria, standard and comparison treatment regimens–which, in the opinion of the trial designed should be in approximate clinical equipoise–and study populations, making them a rich source of data that can be easily structured.??
One major challenge with recommender systems is the "cold start problem"—how to generate recommendations without prior user preference information.? With movies, one can start with critics’ movie reviews.? Since expert opinion is a great place to start, we hypothesized that by embedding extensive information about past and current clinical trials (including population features, treatments under study, and treatment attributes), we could create a robust foundation for our system. After all, the existence of a clinical trial for a product reflects broad expert knowledge from the investors funding companies doing trials, to the investigators conducting the trial, to the ethical reviewers and drug regulators overseeing the trials.
To this, we added data from published case studies and databases of known drug-gene interactions, filtering out any regimen without evidence of prior human use. This approach grounded our model in clinical evidence while incorporating genomic insights. We deliberately chose to exclude transcriptomic pathways—which involve the study of RNA transcripts produced by the genome—as we felt they were potentially too speculative for a tool intended for practicing medical oncologists.*? Ultimately, we compiled a catalog of about 12,000 anti-cancer regimens published in clinical trials, case reports, or observed in our electronic health record (EHR) registry data.
*NB: We later worked with a team of transcriptomic experts from the Institute for Systems Biology on a separate transcriptomic model that did in fact produce good recommendations. ?
Early Success and Unexpected Findings
To our great surprise, the first prototype—which embedded a brief patient case summary—produced a top-10 ranked list of treatment options that was remarkably sensible. Ranking is a challenging problem, especially when ordering 12,000 items for each case, but our scientists and experts agreed that the model at first pass was promising. The success indicated that our model could discern subtle patient differences and relevant treatment options, even with a vast option space and without any data other than the data used for cold start.
However, patients are unique, and treatment decisions often hinge on subtle factors. How does one learn the nuanced boundaries that shift a treatment choice from Option A to Option B for two very similar patients?
Training with Expert Oncologists
To enhance the system's accuracy, we aimed to train it using insights from the best oncologists we could find. We assembled and conducted hundreds of molecular tumor boards—panels of 3–5 expert oncologists specializing in the patient's particular tumor type. Members of these tumor boards were highly published clinicians who had experience as investigators in clinical trials and often were involved in developing clinical guidelines.? Our team of scientists, specializing in cancer biology and pharmacology, prepared detailed cases and initial option lists to facilitate discussions.
We collected consensus-ranked option lists for each patient, along with the tumor board's stated rationale for ranking one option over another. This process not only improved the learning rate of our AI system but also provided a foundation of explainability.
Navigating Regulatory Challenges with the FDA
Our objective was to bring this AI-powered system to the U.S. Food and Drug Administration (FDA) as Software as a Medical Device (SaMD). Between 2020 and 2021, we engaged in multiple submissions and meetings with the FDA. We learned many things, one of which was that an AI-powered system capable of reading data from a patient chart and outputting a ranked list of potential treatment options held great interest among the scientists and clinicians at CDRH. However, obtaining a label that claimed the software would make patients live longer or feel better was not likely feasible under the current regulatory framework, since it was an open question an AI-system that could output an “off-label” recommendation could be approved by the agency.??
领英推荐
The FDA requires rigorous evidence demonstrating that a medical device directly improves patient outcomes. In the case of our AI system, we had a clear plan to collect outcomes, as we were already doing in the XCELSIOR registry, in order to prospectively validate that patients who received a recommended (or higher ranked) option did better than those that did not (or received a lower ranked option).? Yet, that doesn’t matter if our recommendation space is largely in the off-label use of treatments for patients who have no standard of care options. Consequently, the system ended up under FDA’s "Enforcement Discretion" allowing us to offer a free version, which has been used by roughly 250 oncologists.? Oncologist feedback on the system was overwhelmingly positive for two main reasons.? First, while it mostly suggested things that they already were considering, every so often in the mix was something they had overlooked, but made good intuitive sense.? That mix of credibility with surprise is the goal for any developer of recommendation systems.? Second, the rationales of why each option made sense for the individual patient proved useful to clincians for insurance prior authorizations.?
Validation in the Absence of a Gold Standard
One of the most interesting aspects of this research was our effort to validate and measure the system's performance in a situation where no gold standard exists. To date, FDA-approved AI-powered systems have typically been validated against gold-standard datasets produced by human experts. In advanced cancer care, even experts often disagree on the best course of action.
So, we began studying tumor boards themselves to understand inter-tumor board reliability.? In the FDA review of new cancer medications, independent reviewers are used to assess changes in tumor size.? FDA requires that drug developers measure inter-rater reliability to ensure that experts are well calibrated.? So, we relied on those same concepts and applied them to the tumor boards (inter-tumor board reliability) and to individual experts (inter-expert reliability). We did this by running the same patient cases through separate boards or multiple experts to assess how much agreement existed among tumor boards and among experts.? As you might imagine some cases had high agreement, while others did not, reflecting the difficulty of practicing medicine for patients beyond standard-of-care. We found that while tumor boards and experts almost always agree on what not to do for a given patient (specificity, or the percentage of time that two experts or two boards agreed on what not to recommend), they do not completely agree on what to do (sensitivity, of all the options recommended by experts or tumor boards for a given case, what percentage of those options did one expert or one board recommend; and precision, the ratio options that one expert or board recommends that are agreed on by the other board or expert).? In the table below, F1 is a measure that balances specifitiy and precision, which represent the ability to reject true negative options and select true positive options. Interestingly, but perhaps not too surprising was that agreement between tumor boards was higher than agreement between individual expert oncologists.? These results are described in Table 2, below, which include mean scores and standard deviations of the mean scores.? Scores can range from zero to one, with one reflecting a perfect result.
Our goal was for the AI system to perform within the margin of variability that a human tumor board might exhibit, which it does, thereby serving as a helpful tool for clinicians by surfacing options they might overlook during a busy day. Indeed, the clinical utility of the system was evaluated by measuring how many patients actually received an option suggested by a tumor board.? For example, within our brain tumor cohort, of 114 tumor boards, patients started one of the top two options suggested by the tumor-board within six weeks 44% of the time (https://doi.org/10.1093/neuonc/noab196.448).? We strongly believe that the future of clinical decision support lies in hybrid systems that assist busy clinicians without challenging or replacing their judgment.
It's important to recognize that these experts and boards had access to complete medical records, but they weren't sitting in the exam room. Written records, even when comprehensive, may not capture the full range of information available to a clinician during a patient encounter.
Reflecting on Recent Developments
Very recently, I came across an article that resonated with me.? “Expert-Guided Large Language Models for Clinical Decision Support in Precision Oncology” by Lammert and colleagues, published in JCO Precision Oncology. The study involved fine-tuning a large language model (LLM) with curated oncology case report data and measuring its performance against a human tumor board. This approach parallels our efforts and underscores the potential of LLMs in this domain.? It is nice to see progress on this topic, but as we found, to benchmark our tumor boards, we needed another tumor board.? For example, human molecular tumor boards reflect the consensus judgment of experts, like guidelines do more generally, but they are hard to assess.
Our approach relied on inter-tumor board reliability statistics.? But this only tells part of the story.? Do good decisions lead to better outcomes?? We characterized this as fast learning (tumor board recommendations) and slower learning (patient outcomes).? Our plan was to evaluate whether patients who received higher-ranked options lived longer.? The point here is that it will be difficult to train an AI-model that is better than the best humans unless we have some way of reliably measuring the best human experts and separating the relative contribution of good decisions with good outcomes.?
As medicine becomes more complex, tools to help physicians become more important.? This is already apparent in US healthcare.? If you go to an Academic Medical Center for cancer treatment, the odds are that your doctor has seen many similar cases and regularly treats people like you.? He or she is probably up to speed on the latest clinical trials.? However, in the community, your oncologist is seeing a bit of everything, and keeping up with the latest developments takes more of his or her time because of the wider range of cases.? Technology is the way to bring the latest knowledge and expertise to any clinician anywhere.?
The Evolution of AI and LLMs in Oncology
Much has changed since 2020–2021. Back then, using transformer models based on the BERT architecture, extracting features like stage, grade, histopathology, and morphology required hundreds to thousands of labeled positive and negative examples. Today, large language models possess a wealth of current information about cancer and can understand subtle distinctions that may elude many practitioners. For example, LLMs can differentiate between radiographic and clinical disease progression (unpublished observation).? So, when interacting with these models, one must be precise and specific, or the answers may not make sense.
While concerns about "hallucinations"—instances where AI generates incorrect or nonsensical information—have been prevalent, the technology to avoid this issue has advanced rapidly.? Structured Outputs and Retrieval Augmented Generation using both semantic embeddings and graph databases offer a robust factual foundation on which to build responsive AI-systems.? Models themselves are improving faster than Moore’s Law. In an informal comparison I conducted between GPT-3.5-turbo (an estimated 175-billion-parameter model that was state-of-the-art 18 months ago) and Llama 3.2-1B (an open-source 1-billion-parameter model released one month ago), the smaller model noticeably outperformed the larger one. This suggests to me, at least, that we can be optimistic about what is possible in the coming years.
Perhaps in another post, I can delve into why LLM hallucinations are less concerning now than when they were first identified and talk more about how the many new tools available provide more confidence in AI-based output. Finally, while I’m excited about the powerful sequence-learning capacity that LLMs bring to longitudinal medical data relative to traditional feature engineering and statistical models, the ideal system will likely utilize all of these tools.? It should also be noted that despite their advantages, LLMs may lack true reasoning, and providing the necessary medical data to train them to think more like a doctor remains complicated.
Ethical Considerations and Data Challenges
Developing AI systems in healthcare raises significant ethical considerations. We built our system using data collected for that purpose, with explicit patient consent, in an IRB-approved registry.? Another critical aspect is the potential for bias in AI systems. Medical record data used for training often lacks information about which treatment options were considered and why a particular treatment was chosen. This lack of counterfactual information poses a blind spot when training AI/ML systems. By eliciting feedback from tumor boards and capturing their rationale, we aimed to mitigate this issue and enhance the system's explainability.? In addition, the representativeness of the data must be considered.? Our data came from patients across the socioeconomic continuum, from all 50 US states and the District of Columbia, and was generally reflective of published epidemiological data about cancer incidence, but there were areas where we deviated from population estimates.? Of course, data about US-based patients may not generalize to other health systems around the world.? Indeed, the options available in the US are typically broaded than in many countries, especially for investigational treatments.? Moreover, AI/ML algorithms are notoriously tricky when it comes to dealing with low frequency information.? Special effort will be required for rare cancers, an area where these systems may also provide the most clinical benefit.
Shifting Focus: Data Interoperability
A few years ago, our focus at xCures shifted to address an even more challenging problem: obtaining the data that serves as input into these clinical decision support systems, and normalizing and validating that input data. Remarkably, using AI to improve cancer options for physicians was easier than efficiently acquiring and processing the necessary medical records as input for the system.
Efficient data interoperability in healthcare remains a significant hurdle, especially here in the USA. Standardizing data from diverse sources, ensuring its accuracy, and integrating it into AI systems is complex. Yet, as I mentioned in my previous post, we are entering an exciting time for healthcare data interoperability in the United States. I am very optimistic that advancements in healthcare data interoperability converging with improvements in AI will absolutely power transformative changes in healthcare.
Conclusion: The Future of AI in Precision Oncology
I am grateful to the more than 70 oncologists that participated in nearly 1,000 tumor boards.? For the team at xCures responsible for this work, including Drs. Asher Wasserman, Jeff Schrager, Glenn Kramer, Tim Stuhlmiller, Santosh Kesari, Eric Wong, Nicholas Blondin, Ekokobe Fonkem, Julie Friedman, Matt Warner, Zachary Osking, and Jameson Quinn, among many others, developing an AI-powered oncology clinical decision support tool highlighted both an incredible opportunity and a very long list of validation requirements to bring AI into precision oncology. We learned that while AI systems can learn quickly and, in some cases, outperform human experts, the path to regulatory approval and practical implementation is fraught.? So, both optimism and skepticism are warranted.??
The future of clinical decision support will likely involve hybrid systems that augment clinicians' capabilities without replacing their judgment.? As AI technology continues to evolve, and as we overcome hurdles in data interoperability and ethical considerations, we can expect AI to play a role in improving patient outcomes in precision oncology.? For those looking for further reading, some of this work was presented at the DCI workshop series on AI in Medicine, which led to an excellent overview for how to approach these systems from a technical perspective.? That work, by Labkoff et al. can be found in the Journal of the American Medical Informatics Association, 2024, 31(11), 2730–2739. (https://doi.org/10.1093/jamia/ocae209)
By Mark Shapiro, COO, xCures, Inc.
Associate Director, Biology and Drug Discovery ? Advocate for DMG Childhood Brain Cancer Awareness and Research
2 个月Thank you for this report. Are you planning on doing a clinical trial using your AI platform to identify treatment options? If those options yield better patient outcomes versus non-AI informed options, then you can say with certainty that AI systems outperform human experts.
Scientist, CEO, company builder, and investor with a mission to improve health and sustainability. RNA enthusiast. Focused on helping cancer patients survive through better diagnostics and treatment navigation tools
2 个月Thank you for sharing these insights
President at Musella Foundation For Brain Tumor Research & Information, Inc
2 个月A long read but worth it. Brilliant!