Key to Building Trustworthy AI in Healthcare for Everyone, Everywhere
Sailesh Conjeti, PhD
AI Product Management | Gen AI | AI for Medical Imaging | Computer Vision | Med Tech | Regulatory | Global Product Lead | Department Manager |
Disclaimer: This is an abbreviated version for a <10 minute read on the document. For a more detailed view, the readers are encouraged to read the full document at link here . Some of the points here are copied verbatim from the document.
The World Health Organization (WHO) released a seminal first-of-its kind publication detailing a framework targeted at entities developing AI-based software as a medical device with a particular focus on evidence generation requirements through the product lifecycle. Noticeably most of the development of AI/ML models for healthcare have happened in High-Income Countries (HICs). WHO attempts to set guardrails and provide guidance to industry and developers who want to translate AI/ML models from HIC to Low-Middle Income Countries (LMICs).
Section I - AI Software Development
Need for validation
Deploying models developed in HICs without careful validation in LMICs can introduce and?propagate bias (Ref: Panch et al., 2019 .?Economic incentives and regulatory?action might be needed build or at least validate AI solutions in LMIC contexts, using data appropriate to local?populations (Kamulegeya et al. 2019 ). In a global context, the manufacturers of AI tools should generate evidence that includes wider demographic context including diverse ethnic, racial, age, and sex groups (Liu et al., 2020 ). This could also potentially alleviate risks that AI aggregates disparities within populations and between HIC vs. LMICs (Alami et al., 2020 ) by testing the generalizability of AI models. The Hippocratic oath– do no harm – must be the guiding principle for all efforts to scale AI tools.
Intended Use
The intended use of an AI-SaMD should define, as clearly as possible, information pertaining to when, where and how it is to be used. This enables evidence generated to be evaluated in the right context for safety and performance requirements. In this context WHO lays out the Minimum Standard for Intended Use as follows:
Algorithm Training: Upon defining the intended use, the AI model needs to be trained using input data that is representative of the intended use population. Larson et al., (2020) describe four phases of development and evaluation for AI-SaMD diagnostic algorithms as shown below:
Clinical Study Design
Methodologies for designing clinical studies to generate new evidence of clinical effectiveness should be planned immediately after the intended use is defined. The SPIRIT-AI extension includes 15 new items to be included clinical trial protocols of AI interventions as detailed in figure below:
In context of LMIC deployment, the study cohort and contextual information must be pre-specified. Ethics of Deployment become particularly important moving from the HIC to LMIC domain and implementers should answer four primordial questions prior to deploying their solutions:
The SPIRIT-AI extension details on different components towards evaluation of AI-based SaMDs.
Dataset Management
Clinical data used for model training and evaluation must be managed with strict controls to avoid any data leakage. Data selection considerations should include subgroups and confounding factors such as: patient demographics including age, race, image acquisition system, image quality, risk group, clinical setting including community hospitals, primary care, specialists, screening centers etc., pre-curated data (e.g. data-set from previous study/investigation),? geographic location including rural, local, country (relative to target population as defined in intended use statement), clinical demographics, previous pathology, interventions to name a few.
Faes et al. (2020) present the following schema for dataset management and detail the characteristics of the model evaluation datasets.
At a minimum, evidence for data management should be provided on: Access to and handling of data; Acquisition of data, including data sources; Data selection, including inclusion and exclusion criteria; Data de-identification (anonymization), pre-processing and augmentation; Data analysis, including for missing and poor-quality data and Data labelling.
Willemink et al.(2020) detail the process of data governance on medical data and how to prepare that for machine learning as illustrated with the following diagram.
Section II - AI Software Validation and Reporting
External Validation
The design, execution and reporting of external validation studies focuses how well the diagnostic?accuracy of the model translates to clinical accuracy in a real world setting. The following minimum standard should be met:
? Dataset management should feature out-of-sample “unseen” (i.e., protected from developers/investigators) test sets of input data or images.
Faes et al., (2020) detailed that external validation data sets can take different characteristics including: Independent but different in setting and population, Independent but differ in geographic location, Independent in same/new population over time, to test for degradation of algorithm performance as the population evolves and Independent and with the use of different image capture devices.
领英推荐
Data Management
When evaluating an AI-SaMD, a clear data management plan should?be pre-specified. This will cover the following considerations:
Evidence Generation Standards
International?Conference on Harmonization - Good Clinical Practice (ICH-GCP) principles must be followed to ensure safety and?transparency of reporting for all clinical evaluations and investigations. There is however no consensus on how to evaluate and compare evidence generated from the development and implementation of AI-SaMDs.?
FDA's seminal action plan lays down the Total Product Life Cycle (TPLC) approach for managing AI-based SaMDs and is best illustrated in the figure on the right.
Evidence Reporting
The following minimum standards for reporting of technical evidence are detailed follows
Data sources for evaluation of clinical data in relation to its safety and performance includes - peer reviewed data gather via systematic literature search (for e.g., via PRISMA methodology by Moher et al. (2009) ), data from clinical experience and data from clinical trials.
Section III - Deployment and Post-Market Surveillance
Evaluation of Usability
Education and training are required to “grow” an AI-literate workforce, able to take full advantage of?AI-SaMDs and other innovative intervention. During design of UI, the implementers must specify what the system output would be for corner cases such as errors of system in reading input data, incomplete data sets, internal errors, warnings, alerts and output failure. Clinical usability risks such as the output of AI-SaMD being misunderstood, ignored, misused or over relied upon needs to be evaluated. The publication lays out minimum standards for evidence in evaluating usability of AI-SaMD devices, as follows:
Evaluation of Clinical Impact
Large scale adoption and scalability of use of AI-SaMDs need user trust, user and patient experience, and integration?into the actual clinical workflow, with appropriate safeguards for patient safety. ?Trust can be further anchored during prospective real-word use if the AI-SaMD ensures the following key parameters: Clear instructions for use (including labelling), a well-designed user interface, training and experience in using the AI-SaMD, prospectively conducted studies and completely reported validation studies.
Evaluation Metrices: Model evaluation comprises of evaluating either discriminability of AI/ML models or their calibration. Discrimination metrices include threshold free metrics like area under receiver operating characteristics curve, c-statistic etc. It must also be noted that the operating point plays a particularly important role in healthcare settings as these commonly involve binary decisions. The choice depends on the clinical use case (e.g., high specificity for diagnosis, high sensitivity for screening etc.) and resource constraints. If the AI-SaMD also intends to evaluate risks scores, it should output well-calibrated outputs. Metrices such as Hosmer-Lemeshow statistic can reflect the quality of the calibration of these underlying models.
The publication lays down minimum standards for clinical impact evaluation as follows:
Evidence on Implementation
Software Development: Design of AI-SaMD must consider how the software deals with exceptions in real world including incomplete datasets, wrong data formats, data outside pre-specified values, wrong temporal sequence of incoming data etc. Any changes to software post-deployment in form of updates should come with clearly documented changes, identifiable versioning and robust post-market surveillance.
Post-market Surveillance and Monitoring: In addition to PMS activities such as adverse events monitoring, continuous clinical evaluation report update, safety updates etc., PMS data generated by AI-based SaMDs should also be carefully examined using the following parameters:
The publication also lays down minimum standards for post-market clinical follow-up noting that ensuring safety and performance monitoring post-implementation is?required to show sustained clinical impact.
Evidence on Procurement
The document sets guidance for buyers procuring AI-SaMD solutions for clinical implementation from the angle of evidence required to demonstrate safety, performance within clinical context and clinical impact of the device with respect to its intended use. The procurement guidance from NHSX's Buyers Guide to AI for Health and Car e details the following questions and evidence requirements:
Select Bibliography recommend for future reading:
In case you are interest to read further, you would find the full publication here .
Responsible for all facets of Quality - to protect patient safety & secure the XP business
2 年wow - great summary!: also nicely fits to the SHS DI DH achievements with regards to "Trustworthy AI" assesment together with Jens Hofmann, Martin Grandy, MD, MSc, Michael Adling and Fabian Schoeck #aiforhealth?#trustworthyai?#ai4good?#digitalhealth
Nice read, thanks Sailesh!
President, Patrick J. McGovern Foundation | Building Human Opportunity in a Digital World | Global AI Policy Expert | Independent Board Director | Beauty in the Unexpected
2 年Thank you for sharing and breaking down this longer (and informative) piece into more a concise read.
Excellent summary - thanks Sailesh Conjeti