Key to Building Trustworthy AI in Healthcare for Everyone, Everywhere

Key to Building Trustworthy AI in Healthcare for Everyone, Everywhere

Disclaimer: This is an abbreviated version for a <10 minute read on the document. For a more detailed view, the readers are encouraged to read the full document at link here . Some of the points here are copied verbatim from the document.

The World Health Organization (WHO) released a seminal first-of-its kind publication detailing a framework targeted at entities developing AI-based software as a medical device with a particular focus on evidence generation requirements through the product lifecycle. Noticeably most of the development of AI/ML models for healthcare have happened in High-Income Countries (HICs). WHO attempts to set guardrails and provide guidance to industry and developers who want to translate AI/ML models from HIC to Low-Middle Income Countries (LMICs).

Section I - AI Software Development

Need for validation

Deploying models developed in HICs without careful validation in LMICs can introduce and?propagate bias (Ref: Panch et al., 2019 .?Economic incentives and regulatory?action might be needed build or at least validate AI solutions in LMIC contexts, using data appropriate to local?populations (Kamulegeya et al. 2019 ). In a global context, the manufacturers of AI tools should generate evidence that includes wider demographic context including diverse ethnic, racial, age, and sex groups (Liu et al., 2020 ). This could also potentially alleviate risks that AI aggregates disparities within populations and between HIC vs. LMICs (Alami et al., 2020 ) by testing the generalizability of AI models. The Hippocratic oath– do no harm – must be the guiding principle for all efforts to scale AI tools.

Intended Use

The intended use of an AI-SaMD should define, as clearly as possible, information pertaining to when, where and how it is to be used. This enables evidence generated to be evaluated in the right context for safety and performance requirements. In this context WHO lays out the Minimum Standard for Intended Use as follows:

No alt text provided for this image

Algorithm Training: Upon defining the intended use, the AI model needs to be trained using input data that is representative of the intended use population. Larson et al., (2020) describe four phases of development and evaluation for AI-SaMD diagnostic algorithms as shown below:

No alt text provided for this image

  1. Feasibility - mainly in innovation phase aim to assess whether the AI-SaMD works as intended in a controlled setting, and involve training and internal validation of the algorithm.
  2. Capability - testing the accuracy of the model in a controlled environment that simulates real world conditions, and applying it to a dataset independent of that used for training the model. Accuracy, Reliability and Safety are the key questions to ask in this phase.
  3. Effectiveness - These studies aim to confirm that the real-world performance of the algorithm matches its performance in the test environment, post clinical implementation.
  4. Durability - This stage comprises generation of clinical data to track ongoing performance, for use in evaluation and monitoring.

Clinical Study Design

Methodologies for designing clinical studies to generate new evidence of clinical effectiveness should be planned immediately after the intended use is defined. The SPIRIT-AI extension includes 15 new items to be included clinical trial protocols of AI interventions as detailed in figure below:

No alt text provided for this image

In context of LMIC deployment, the study cohort and contextual information must be pre-specified. Ethics of Deployment become particularly important moving from the HIC to LMIC domain and implementers should answer four primordial questions prior to deploying their solutions:

  • User studies and user experience research: Outcomes of user evaluation done in HICs might be vastly different to users from LMICs such as workflow differences, knowledge / experience gaps, training, technology exposure etc. and it is important to validate this to avoid potential misuse and unintended consequences.
  • Hazards and safety: Moving from HICs to LMICs data drifts owing to unseen population, disease patterns, data quality etc. are potential and the device calibration must be checked prior to deployment.
  • Technical Infrastructure: Assumptions on connectivity, access to cloud, types of modalities may be vastly different between HICs and LMICs.
  • Bias and Fairness: Blind spots with underrepresented and underperforming subgroups should be rigorously investigated and mitigated when deploying solutions from HICs to LMICs.

The SPIRIT-AI extension details on different components towards evaluation of AI-based SaMDs.

No alt text provided for this image

Dataset Management

Clinical data used for model training and evaluation must be managed with strict controls to avoid any data leakage. Data selection considerations should include subgroups and confounding factors such as: patient demographics including age, race, image acquisition system, image quality, risk group, clinical setting including community hospitals, primary care, specialists, screening centers etc., pre-curated data (e.g. data-set from previous study/investigation),? geographic location including rural, local, country (relative to target population as defined in intended use statement), clinical demographics, previous pathology, interventions to name a few.

No alt text provided for this image

Faes et al. (2020) present the following schema for dataset management and detail the characteristics of the model evaluation datasets.

  • Model Training: Evidence out of model training should include: an overview of the AI system, detailed description of AI system and architecture, description of the model training datasets, and technical requirements for software verification.
  • Model Evaluation: External validation should be performed with an independent test set from an external source; this?is to demonstrate generalizability of the model and should be performed by independent evaluators.?

At a minimum, evidence for data management should be provided on: Access to and handling of data; Acquisition of data, including data sources; Data selection, including inclusion and exclusion criteria; Data de-identification (anonymization), pre-processing and augmentation; Data analysis, including for missing and poor-quality data and Data labelling.

No alt text provided for this image

Willemink et al.(2020) detail the process of data governance on medical data and how to prepare that for machine learning as illustrated with the following diagram.



Section II - AI Software Validation and Reporting

External Validation

The design, execution and reporting of external validation studies focuses how well the diagnostic?accuracy of the model translates to clinical accuracy in a real world setting. The following minimum standard should be met:

? Dataset management should feature out-of-sample “unseen” (i.e., protected from developers/investigators) test sets of input data or images.

  1. Piloting and monitoring of data collection should be carried out to ensure diagnostic accuracy is maintained.
  2. Independent (peer) review should be carried out on output data
  3. The algorithm should be retrained if performance of AI-SaMD does not meet pre-specified performance target.
  4. The algorithm version should be updated and re-tested on prospective independent test set

Faes et al., (2020) detailed that external validation data sets can take different characteristics including: Independent but different in setting and population, Independent but differ in geographic location, Independent in same/new population over time, to test for degradation of algorithm performance as the population evolves and Independent and with the use of different image capture devices.

Data Management

When evaluating an AI-SaMD, a clear data management plan should?be pre-specified. This will cover the following considerations:

  • Data collection (retrospective/prospective); selection, inclusion and exclusion criteria (at patient level?and input data (image) level). During selection, attention must be paid to diversity, spectrum bias and underrepresentation of certain population groups.
  • Digital image capture specifications, including image acquisition types
  • Data de-identification, including anonymization given other sensitive clinical data
  • Data storage format
  • Data quality assessment and machine language model features. It must be noted that ?performance may be overestimated or safety overlooked in a highly curated/cleaned research dataset. ?
  • Reference standard determination, including methodology for annotation / ground truth labelling of images
  • Dataset split management and sample size determination
  • Details of an expanded training dataset and addition of supplementary datasets?
  • Data augmentation strategy (addition of independent training and test dataset, control access to?both training and test dataset as additional data are being included and revised algorithm is being?tuned, retrained and tested)

Evidence Generation Standards

International?Conference on Harmonization - Good Clinical Practice (ICH-GCP) principles must be followed to ensure safety and?transparency of reporting for all clinical evaluations and investigations. There is however no consensus on how to evaluate and compare evidence generated from the development and implementation of AI-SaMDs.?

No alt text provided for this image


FDA's seminal action plan lays down the Total Product Life Cycle (TPLC) approach for managing AI-based SaMDs and is best illustrated in the figure on the right.


Evidence Reporting

The following minimum standards for reporting of technical evidence are detailed follows

  • Full details on development of the AI algorithm including intended use, subject populations, training and?testing data, and public accessibility of the code
  • Technical information regarding on-site application of the AI technology
  • Details about Human–AI interactions, including required expertise of the user and how the AI output?contributed to clinical decision making
  • Specificity with regards to what version of an AI algorithm was used, given that performance of some?algorithms can change iteratively, or in some cases, continuously

Data sources for evaluation of clinical data in relation to its safety and performance includes - peer reviewed data gather via systematic literature search (for e.g., via PRISMA methodology by Moher et al. (2009) ), data from clinical experience and data from clinical trials.

Section III - Deployment and Post-Market Surveillance

Evaluation of Usability

Education and training are required to “grow” an AI-literate workforce, able to take full advantage of?AI-SaMDs and other innovative intervention. During design of UI, the implementers must specify what the system output would be for corner cases such as errors of system in reading input data, incomplete data sets, internal errors, warnings, alerts and output failure. Clinical usability risks such as the output of AI-SaMD being misunderstood, ignored, misused or over relied upon needs to be evaluated. The publication lays out minimum standards for evidence in evaluating usability of AI-SaMD devices, as follows:

  • Evidence of integration into clinical workflow with sustained overall benefit
  • Infrastructure and conditions to allow for use of device as intended
  • Effects of adding AI-SaMD to current standard
  • Effects of disagreements between output of AI-SaMD and clinical decision of health care worker
  • Users’ interaction with output of AI-SaMD. Is the output interpretable? Error rates? Is the image readable?

Evaluation of Clinical Impact

Large scale adoption and scalability of use of AI-SaMDs need user trust, user and patient experience, and integration?into the actual clinical workflow, with appropriate safeguards for patient safety. ?Trust can be further anchored during prospective real-word use if the AI-SaMD ensures the following key parameters: Clear instructions for use (including labelling), a well-designed user interface, training and experience in using the AI-SaMD, prospectively conducted studies and completely reported validation studies.

Evaluation Metrices: Model evaluation comprises of evaluating either discriminability of AI/ML models or their calibration. Discrimination metrices include threshold free metrics like area under receiver operating characteristics curve, c-statistic etc. It must also be noted that the operating point plays a particularly important role in healthcare settings as these commonly involve binary decisions. The choice depends on the clinical use case (e.g., high specificity for diagnosis, high sensitivity for screening etc.) and resource constraints. If the AI-SaMD also intends to evaluate risks scores, it should output well-calibrated outputs. Metrices such as Hosmer-Lemeshow statistic can reflect the quality of the calibration of these underlying models.

The publication lays down minimum standards for clinical impact evaluation as follows:

  • Comparison to gold standard
  • Measures of improvement in patient outcomes, clinical process, or time efficiency
  • Measures of acceptable unintended consequences and absence of harm to patients
  • Changes in experience of patient or user (i.e., health care worker)

Evidence on Implementation

Software Development: Design of AI-SaMD must consider how the software deals with exceptions in real world including incomplete datasets, wrong data formats, data outside pre-specified values, wrong temporal sequence of incoming data etc. Any changes to software post-deployment in form of updates should come with clearly documented changes, identifiable versioning and robust post-market surveillance.

Post-market Surveillance and Monitoring: In addition to PMS activities such as adverse events monitoring, continuous clinical evaluation report update, safety updates etc., PMS data generated by AI-based SaMDs should also be carefully examined using the following parameters:

  1. Quality of metrics: precision and accuracy, sensitivity/specificity
  2. Selection of operating points for thresholds: the validation set is usually employed to set operating points as this better simulates prospective deployment; clinical outcome data should be monitored?to ensure expected performance metrics are being observed
  3. Variance of performance metrics over time
  4. Is the data in the field (real-world performance data) consistent with expected data ?
  5. Threshold values to trigger actions and modifications: Re-evaluation of the benefits-risk analysis, Re-training of the algorithm (unlock, re-train, version update) and Product recall.

The publication also lays down minimum standards for post-market clinical follow-up noting that ensuring safety and performance monitoring post-implementation is?required to show sustained clinical impact.

No alt text provided for this image

Evidence on Procurement

The document sets guidance for buyers procuring AI-SaMD solutions for clinical implementation from the angle of evidence required to demonstrate safety, performance within clinical context and clinical impact of the device with respect to its intended use. The procurement guidance from NHSX's Buyers Guide to AI for Health and Car e details the following questions and evidence requirements:

No alt text provided for this image

Select Bibliography recommend for future reading:

In case you are interest to read further, you would find the full publication here .



Andreas Herdegen

Responsible for all facets of Quality - to protect patient safety & secure the XP business

2 年

wow - great summary!: also nicely fits to the SHS DI DH achievements with regards to "Trustworthy AI" assesment together with Jens Hofmann, Martin Grandy, MD, MSc, Michael Adling and Fabian Schoeck #aiforhealth?#trustworthyai?#ai4good?#digitalhealth

Nice read, thanks Sailesh!

Vilas Dhar

President, Patrick J. McGovern Foundation | Building Human Opportunity in a Digital World | Global AI Policy Expert | Independent Board Director | Beauty in the Unexpected

2 年

Thank you for sharing and breaking down this longer (and informative) piece into more a concise read.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了