登录查看更多内容

Key to Building Trustworthy AI in Healthcare for Everyone, Everywhere

Sailesh Conjeti, PhD

AI Product Management | Gen AI | AI for Medical Imaging | Computer Vision | Med Tech | Regulatory | Global Product Lead | Department Manager |

发布日期: 2022年3月23日

Disclaimer: This is an abbreviated version for a <10 minute read on the document. For a more detailed view, the readers are encouraged to read the full document at link here . Some of the points here are copied verbatim from the document.

The World Health Organization (WHO) released a seminal first-of-its kind publication detailing a framework targeted at entities developing AI-based software as a medical device with a particular focus on evidence generation requirements through the product lifecycle. Noticeably most of the development of AI/ML models for healthcare have happened in High-Income Countries (HICs). WHO attempts to set guardrails and provide guidance to industry and developers who want to translate AI/ML models from HIC to Low-Middle Income Countries (LMICs).

Section I - AI Software Development

Need for validation

Deploying models developed in HICs without careful validation in LMICs can introduce and?propagate bias (Ref: Panch et al., 2019 .?Economic incentives and regulatory?action might be needed build or at least validate AI solutions in LMIC contexts, using data appropriate to local?populations (Kamulegeya et al. 2019 ). In a global context, the manufacturers of AI tools should generate evidence that includes wider demographic context including diverse ethnic, racial, age, and sex groups (Liu et al., 2020 ). This could also potentially alleviate risks that AI aggregates disparities within populations and between HIC vs. LMICs (Alami et al., 2020 ) by testing the generalizability of AI models. The Hippocratic oath– do no harm – must be the guiding principle for all efforts to scale AI tools.

Intended Use

The intended use of an AI-SaMD should define, as clearly as possible, information pertaining to when, where and how it is to be used. This enables evidence generated to be evaluated in the right context for safety and performance requirements. In this context WHO lays out the Minimum Standard for Intended Use as follows:

Algorithm Training: Upon defining the intended use, the AI model needs to be trained using input data that is representative of the intended use population. Larson et al., (2020) describe four phases of development and evaluation for AI-SaMD diagnostic algorithms as shown below:

Feasibility - mainly in innovation phase aim to assess whether the AI-SaMD works as intended in a controlled setting, and involve training and internal validation of the algorithm.
Capability - testing the accuracy of the model in a controlled environment that simulates real world conditions, and applying it to a dataset independent of that used for training the model. Accuracy, Reliability and Safety are the key questions to ask in this phase.
Effectiveness - These studies aim to confirm that the real-world performance of the algorithm matches its performance in the test environment, post clinical implementation.
Durability - This stage comprises generation of clinical data to track ongoing performance, for use in evaluation and monitoring.

Clinical Study Design

Methodologies for designing clinical studies to generate new evidence of clinical effectiveness should be planned immediately after the intended use is defined. The SPIRIT-AI extension includes 15 new items to be included clinical trial protocols of AI interventions as detailed in figure below:

In context of LMIC deployment, the study cohort and contextual information must be pre-specified. Ethics of Deployment become particularly important moving from the HIC to LMIC domain and implementers should answer four primordial questions prior to deploying their solutions:

User studies and user experience research: Outcomes of user evaluation done in HICs might be vastly different to users from LMICs such as workflow differences, knowledge / experience gaps, training, technology exposure etc. and it is important to validate this to avoid potential misuse and unintended consequences.
Hazards and safety: Moving from HICs to LMICs data drifts owing to unseen population, disease patterns, data quality etc. are potential and the device calibration must be checked prior to deployment.
Technical Infrastructure: Assumptions on connectivity, access to cloud, types of modalities may be vastly different between HICs and LMICs.
Bias and Fairness: Blind spots with underrepresented and underperforming subgroups should be rigorously investigated and mitigated when deploying solutions from HICs to LMICs.

The SPIRIT-AI extension details on different components towards evaluation of AI-based SaMDs.

Dataset Management

Clinical data used for model training and evaluation must be managed with strict controls to avoid any data leakage. Data selection considerations should include subgroups and confounding factors such as: patient demographics including age, race, image acquisition system, image quality, risk group, clinical setting including community hospitals, primary care, specialists, screening centers etc., pre-curated data (e.g. data-set from previous study/investigation),? geographic location including rural, local, country (relative to target population as defined in intended use statement), clinical demographics, previous pathology, interventions to name a few.

Faes et al. (2020) present the following schema for dataset management and detail the characteristics of the model evaluation datasets.

Model Training: Evidence out of model training should include: an overview of the AI system, detailed description of AI system and architecture, description of the model training datasets, and technical requirements for software verification.
Model Evaluation: External validation should be performed with an independent test set from an external source; this?is to demonstrate generalizability of the model and should be performed by independent evaluators.?

At a minimum, evidence for data management should be provided on: Access to and handling of data; Acquisition of data, including data sources; Data selection, including inclusion and exclusion criteria; Data de-identification (anonymization), pre-processing and augmentation; Data analysis, including for missing and poor-quality data and Data labelling.

Willemink et al.(2020) detail the process of data governance on medical data and how to prepare that for machine learning as illustrated with the following diagram.

Section II - AI Software Validation and Reporting

External Validation

The design, execution and reporting of external validation studies focuses how well the diagnostic?accuracy of the model translates to clinical accuracy in a real world setting. The following minimum standard should be met:

? Dataset management should feature out-of-sample “unseen” (i.e., protected from developers/investigators) test sets of input data or images.

Piloting and monitoring of data collection should be carried out to ensure diagnostic accuracy is maintained.
Independent (peer) review should be carried out on output data
The algorithm should be retrained if performance of AI-SaMD does not meet pre-specified performance target.
The algorithm version should be updated and re-tested on prospective independent test set

Faes et al., (2020) detailed that external validation data sets can take different characteristics including: Independent but different in setting and population, Independent but differ in geographic location, Independent in same/new population over time, to test for degradation of algorithm performance as the population evolves and Independent and with the use of different image capture devices.

Alp Consulting Ltd. 1 年前

What's Next For AI In Healthcare In 2023

Bertalan Meskó, MD, PhD 1 年前

Generative AI in Healthcare - What You Need to Know –…

Bertalan Meskó, MD, PhD 7 个月前

Data Management

When evaluating an AI-SaMD, a clear data management plan should?be pre-specified. This will cover the following considerations:

Data collection (retrospective/prospective); selection, inclusion and exclusion criteria (at patient level?and input data (image) level). During selection, attention must be paid to diversity, spectrum bias and underrepresentation of certain population groups.
Digital image capture specifications, including image acquisition types
Data de-identification, including anonymization given other sensitive clinical data
Data storage format
Data quality assessment and machine language model features. It must be noted that ?performance may be overestimated or safety overlooked in a highly curated/cleaned research dataset. ?
Reference standard determination, including methodology for annotation / ground truth labelling of images
Dataset split management and sample size determination
Details of an expanded training dataset and addition of supplementary datasets?
Data augmentation strategy (addition of independent training and test dataset, control access to?both training and test dataset as additional data are being included and revised algorithm is being?tuned, retrained and tested)

Evidence Generation Standards

International?Conference on Harmonization - Good Clinical Practice (ICH-GCP) principles must be followed to ensure safety and?transparency of reporting for all clinical evaluations and investigations. There is however no consensus on how to evaluate and compare evidence generated from the development and implementation of AI-SaMDs.?

FDA's seminal action plan lays down the Total Product Life Cycle (TPLC) approach for managing AI-based SaMDs and is best illustrated in the figure on the right.

Evidence Reporting

The following minimum standards for reporting of technical evidence are detailed follows

Full details on development of the AI algorithm including intended use, subject populations, training and?testing data, and public accessibility of the code
Technical information regarding on-site application of the AI technology
Details about Human–AI interactions, including required expertise of the user and how the AI output?contributed to clinical decision making
Specificity with regards to what version of an AI algorithm was used, given that performance of some?algorithms can change iteratively, or in some cases, continuously

Data sources for evaluation of clinical data in relation to its safety and performance includes - peer reviewed data gather via systematic literature search (for e.g., via PRISMA methodology by Moher et al. (2009) ), data from clinical experience and data from clinical trials.

Section III - Deployment and Post-Market Surveillance

Evaluation of Usability

Education and training are required to “grow” an AI-literate workforce, able to take full advantage of?AI-SaMDs and other innovative intervention. During design of UI, the implementers must specify what the system output would be for corner cases such as errors of system in reading input data, incomplete data sets, internal errors, warnings, alerts and output failure. Clinical usability risks such as the output of AI-SaMD being misunderstood, ignored, misused or over relied upon needs to be evaluated. The publication lays out minimum standards for evidence in evaluating usability of AI-SaMD devices, as follows:

Evidence of integration into clinical workflow with sustained overall benefit
Infrastructure and conditions to allow for use of device as intended
Effects of adding AI-SaMD to current standard
Effects of disagreements between output of AI-SaMD and clinical decision of health care worker
Users’ interaction with output of AI-SaMD. Is the output interpretable? Error rates? Is the image readable?

Evaluation of Clinical Impact

Large scale adoption and scalability of use of AI-SaMDs need user trust, user and patient experience, and integration?into the actual clinical workflow, with appropriate safeguards for patient safety. ?Trust can be further anchored during prospective real-word use if the AI-SaMD ensures the following key parameters: Clear instructions for use (including labelling), a well-designed user interface, training and experience in using the AI-SaMD, prospectively conducted studies and completely reported validation studies.

Evaluation Metrices: Model evaluation comprises of evaluating either discriminability of AI/ML models or their calibration. Discrimination metrices include threshold free metrics like area under receiver operating characteristics curve, c-statistic etc. It must also be noted that the operating point plays a particularly important role in healthcare settings as these commonly involve binary decisions. The choice depends on the clinical use case (e.g., high specificity for diagnosis, high sensitivity for screening etc.) and resource constraints. If the AI-SaMD also intends to evaluate risks scores, it should output well-calibrated outputs. Metrices such as Hosmer-Lemeshow statistic can reflect the quality of the calibration of these underlying models.

The publication lays down minimum standards for clinical impact evaluation as follows:

Comparison to gold standard
Measures of improvement in patient outcomes, clinical process, or time efficiency
Measures of acceptable unintended consequences and absence of harm to patients
Changes in experience of patient or user (i.e., health care worker)

Evidence on Implementation

Software Development: Design of AI-SaMD must consider how the software deals with exceptions in real world including incomplete datasets, wrong data formats, data outside pre-specified values, wrong temporal sequence of incoming data etc. Any changes to software post-deployment in form of updates should come with clearly documented changes, identifiable versioning and robust post-market surveillance.

Post-market Surveillance and Monitoring: In addition to PMS activities such as adverse events monitoring, continuous clinical evaluation report update, safety updates etc., PMS data generated by AI-based SaMDs should also be carefully examined using the following parameters:

Quality of metrics: precision and accuracy, sensitivity/specificity
Selection of operating points for thresholds: the validation set is usually employed to set operating points as this better simulates prospective deployment; clinical outcome data should be monitored?to ensure expected performance metrics are being observed
Variance of performance metrics over time
Is the data in the field (real-world performance data) consistent with expected data ?
Threshold values to trigger actions and modifications: Re-evaluation of the benefits-risk analysis, Re-training of the algorithm (unlock, re-train, version update) and Product recall.

The publication also lays down minimum standards for post-market clinical follow-up noting that ensuring safety and performance monitoring post-implementation is?required to show sustained clinical impact.

Evidence on Procurement

The document sets guidance for buyers procuring AI-SaMD solutions for clinical implementation from the angle of evidence required to demonstrate safety, performance within clinical context and clinical impact of the device with respect to its intended use. The procurement guidance from NHSX's Buyers Guide to AI for Health and Car e details the following questions and evidence requirements:

Select Bibliography recommend for future reading:

?Buston, O, Chowdhury, Pick A, P. Digital health in low- and lower-middle-income countries. 2019 Sep?cited 2021 Jul 29]; Available from: https://pathwayscommission.bsg.ox.ac.uk/Digital-health-paper
?Liu X, Faes L, Calvert MJ, Denniston AK, CONSORT/SPIRIT-AI Extension Group. Extension of the?CONSORT and SPIRIT statements. Lancet Lond Engl. 2019 Oct 5;394(10205):1225.
Panch T, Mattie H, Atun R. Artificial intelligence and algorithmic bias: implications for health systems .?J Glob Health. 9(2):020318.?
?Larson DB, Harvey H, Rubin DL, Irani N, Tse JR, Langlotz CP. Regulatory Frameworks for Development?and Evaluation of Artificial Intelligence–Based Diagnostic Imaging Algorithms: Summary and?Recommendations. J Am Coll Radiol. 2021 Mar;18(3):413–24.?
?Faes L, Liu X, Wagner SK, Fu DJ, Balaskas K, Sim DA, et al. A Clinician’s Guide to Artificial Intelligence:?How to Critically Appraise Machine Learning Studies . Transl Vis Sci Technol. 2020 Jan 28;9(2):7–7.?
Willemink MJ, Koszek WA, Hardell C, Wu J, Fleischmann D, Harvey H, et al. Preparing Medical Imaging Data for Machine Learning. Radiology. 2020 Apr;295(1):4–15.?
?Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical?impact with artificial intelligence. BMC Med. 2019 Oct 29;17(1):195.?
?Indra Joshi, Dominic Cushnan. A Buyer’s Guide to AI in Health and Care [Internet]. NHS;?[cited 2021 Jul 29]. Available from: https://www.nhsx.nhs.uk/ai-lab/explore-all-resources/adopt-ai/a-buyers-guide-to-ai-in-health-and-care/

In case you are interest to read further, you would find the full publication here .

Andreas Herdegen

Responsible for all facets of Quality - to protect patient safety & secure the XP business

2 年

wow - great summary!: also nicely fits to the SHS DI DH achievements with regards to "Trustworthy AI" assesment together with Jens Hofmann, Martin Grandy, MD, MSc, Michael Adling and Fabian Schoeck #aiforhealth?#trustworthyai?#ai4good?#digitalhealth

3 次回应

Ali Kamen

2 年

Nice read, thanks Sailesh!

1 次回应

Vilas Dhar

President, Patrick J. McGovern Foundation | Building Human Opportunity in a Digital World | Global AI Policy Expert | Independent Board Director | Beauty in the Unexpected

2 年

Thank you for sharing and breaking down this longer (and informative) piece into more a concise read.

1 次回应

Michael Friebe

2 年

Excellent summary - thanks Sailesh Conjeti

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Key to Building Trustworthy AI in Healthcare for Everyone, Everywhere

Sailesh Conjeti, PhD

AI Product Management | Gen AI | AI for Medical Imaging | Computer Vision | Med Tech | Regulatory | Global Product Lead | Department Manager |

Section I - AI Software Development

Need for validation

Intended Use

Clinical Study Design

Dataset Management

Section II - AI Software Validation and Reporting

External Validation

领英推荐

Data Management

Evidence Generation Standards

Evidence Reporting

Section III - Deployment and Post-Market Surveillance

Evaluation of Usability

Evaluation of Clinical Impact

Evidence on Implementation

Evidence on Procurement

Select Bibliography recommend for future reading:

更多精彩文章

社区洞察

其他会员也浏览了

What If Generative AI Turned To Be A Flop In Healthcare?

ChatGPT Scored 72% in Clinical Decision Accuracy – This And More News In Digital Health This Week

Interesting reads ... June 2024

How generative AI will change the doctor-patient relationship

The Coming Wave of AI in Healthcare: A Blog Inspired by Mustafa Suleyman's "The Coming Wave"

What to do when patients are using AI more than you are

Unveiling the Impact of AI Bias in Healthcare

The Dawn of ReAct Prompting in Healthcare AI

The Promise and Potential Pitfalls of Generative AI in Healthcare

The Future of Medicine: Key Competencies for Integrating AI into Clinical Practice

Section I - AI Software Development

Need for validation

Intended Use

Clinical Study Design

Dataset Management

Section II - AI Software Validation and Reporting

External Validation

领英推荐

Data Management

Evidence Generation Standards

Evidence Reporting

Section III - Deployment and Post-Market Surveillance

Evaluation of Usability

Evaluation of Clinical Impact

Evidence on Implementation

Evidence on Procurement

Select Bibliography recommend for future reading:

Demystifying PCCPs: How Predetermined Change Control Plans (PCCPs) are Shaping the Future of AI-Enabled Medical Devices

2024年10月20日

Tracing the AI (r)Evolution: FDA De Novo AI-enabled Medical Devices

2024年5月1日

AI in Gastrointestinal Endoscopy: Insights and Perspectives from Global Surveys

2024年4月14日

Blueprint for building / tuning Foundational Models for Clinical Documentation

2023年9月17日

Transforming Healthcare: A Step-by-Step Guide to Building and Deploying AI Medical Devices

2023年6月25日

社区洞察

其他会员也浏览了

What If Generative AI Turned To Be A Flop In Healthcare?

ChatGPT Scored 72% in Clinical Decision Accuracy – This And More News In Digital Health This Week

Interesting reads ... June 2024

How generative AI will change the doctor-patient relationship

The Coming Wave of AI in Healthcare: A Blog Inspired by Mustafa Suleyman's "The Coming Wave"

What to do when patients are using AI more than you are

Unveiling the Impact of AI Bias in Healthcare

The Dawn of ReAct Prompting in Healthcare AI

The Promise and Potential Pitfalls of Generative AI in Healthcare

The Future of Medicine: Key Competencies for Integrating AI into Clinical Practice