Scaling Pathology Foundation Models from the ROI to Gigapixel WSIs

Scaling Pathology Foundation Models from the ROI to Gigapixel WSIs

Large pretrained models are set to have transformative impact in the clinical translation of AI in pathology workflows. For challenging tasks like treatment-to-response assessment using gigapixel whole-slide images (WSIs), how tractable is it to train end-to-end supervised models on clinical trial data with 100 patients? In Hierarchical Image Pyramid Transformer (or HIPT, published in CVPR 2022 ), we previously showed that simple linear classifiers trained on pre-extracted slide-level features can compete with MIL methods in small sample size regimes. For retrospective clinical trial data, methods like HIPT can be used to integrate deep features from routine pathology examination with other clinical variables such as patient history and molecular panels when developing Cox models for patient stratification.

Since publication, HIPT has found success in a variety of pathology applications from panoptic cell segmentation , IBD severity assessment , and multimodal integration with Spatial Transcriptomics . Compared to models today though, the pretrained ROI encoder in HIPT is not nearly as powerful. To build better models that would achieve slide-level self-supervised learning, we not only need to scale the data and model capacity in hierarchical pretraining, but also need a strong "foundation" of features to build on top.

Current Progress on ROI-Level Foundation Models

Since 2022, there have been many other awesome ROI encoders that have had an important impact in computational pathology. CTransPath was one of the first to show consistent and meaningful performance gains over ResNet-50 with ImageNet Transfer. REMEDIS demonstrated data-efficiency in building pathology AI models that also generalize to independent cohorts. PLIP is the first ROI encoder developed using vision-language pretraining. Last month, our research group has developed our own powerful ROI encoders (UNI and CONCH) with new capabilities and extensive evaluation across diverse demographics, under-represented and rare diseases, and challenging machine learning regimes.

UNI

With Ming Yang (Max) Lu , Tong Ding , and Drew Williamson (published in Nature Medicine, March 2024 ), we developed UNI, a vision encoder developed on 100M ROIs cross 100K WSIs, available on HuggingFace. Since our preprint, we have expanded the evaluation of UNI with comparisons with public leaderboards, and report new findings on the impact that using stronger ROI encoders can have.

Many methods such as CLAM, SCL-WC, DTFD-MIL, and MHIM-MIL were originally motivated in solving the weakly-supervised "needle-in-haystack" problem in CAMELYON16 (C16), which proposed adding clustering contraints, mitigating patch redundancy, hard negative mining and more. Though improving MIL architecture has certainly advanced the SOTA on C16 (from 0.936 AUC in CLAM to 0.965 AUC in MHIM-MIL) in models that use ImageNet features, a very simple MIL architecture (ABMIL) with UNI features can reach SOTA weakly-supervised performance on C16. With stronger MIL architectures, we may soon outperform the fully-supervised performance in the original challenge! The comparisons here are not to emphasize SOTA (which are relative to the ROI encoders used), but rather to emphasize the significance of representation quality in building pathology AI models. This has also been seen in other recent works using UNI, such as: the 1st place solution in the Nightingale 2024 Detecting Active Tuberculosis Bacilli Data Challenge and clinical validation in ovarian cancer subtype classification . We hope that there continues to be significant interest in investigating SSL in pathology (as much as there has been in developing MIL models)!

Demographic bias in misdiagnosis and self-supervised learning

With Anurag Vaidya and Drew Williamson (published in Nature Medicine, April 2024 ), we present in-depth examination of AI biases in pathology and their disparate impact on diverse patient populations. How fair are pathology AI models in cancer subtyping and biomarker prediction when trained on TCGA and evaluated on self-reported White, Asian, and Black patient cohorts? Amongst other contributing factors investigated, we find that using UNI as an ROI encoder increases overall and subgroup-level performance, which can have a significant impact in mitigating performance disparities. However, persistent gaps remain in tasks such as IDH1 mutation prediction, with further investigation required in understanding the impact of cancer health disparities in model development. With foundation models having profound implications in accelerating the development AI-SaMDs for individual disease models, equal due diligence must be applied toward scrutinizing their performance and failure modes across diverse demographics (following reporting guidelines such as CONSORT-AI and STARD-AI).

CONCH

Led by Ming Yang (Max) Lu , Bowen Chen , and Drew Williamson (published in Nature Medicine, March 2024 ), we developed CONCH, a vision-language ROI foundation model developed on 1.17M histopathology image-caption pairs, available on HuggingFace. Since our preprint, we have expanded the evaluation of CONCH (with updated comparisons) on challenging slide-level tasks including rare disease classification (EBRAINS, OT-108) and biomarker screening with IHC slides (ER and PR assessment). We present two new findings.

First, we find that CONCH to be more versatile when applied to IHC tasks. Though UNI and CONCH are close to each other in performance across slide-level tasks, we observe greater performance gains when using CONCH on IHC tasks (+0.034 quad. weighted κ over UNI on ER/PR screening) than H&E tasks (-0.12 quad. weighted κ below UNI on PANDA). This may be attributed to the data distribution in CONCH which includes IHC images, as well as CONCH learning representational invariances when aligned with language. Though further investigation is needed, our findings suggest that CONCH would have better versatility than UNI and other pretrained encoders when extracting features from multimodal IHC data.

Second, we find vision-language pretraining to be a data-efficient paradigm for building ROI foundation models. Across all tasks, UNI and CONCH are consistently the top-3 best performing models, with UNI achieving the best overall performance. However, we note the difference in resources used: CONCH was first developed using 16M images via vision-only SSL (iBOT on Mass-22K, 66M parameters in ViT-Base) and then aligned with 1.17M image-caption pairs; UNI was developed using 100M images via vision-only SSL (DINOv2 on Mass-100K, 307M parameters in ViT-Large). Despite having significantly less data and model capacity of UNI, CONCH can reach close to the same performance for certain tasks (< 0.02 performance gap), with additional capabilities from vision-language integration (e.g. - zero-shot classification, text-to-image retrieval). Still, initializing with vision-only SSL before vision-language pretraining seems to be an important requisite . PLIP and Quilt-1M, which are pretrained with 200K and 1M image-caption pairs respectively (without vision-only SSL), lag behind CONCH and other vision encoders.

Towards Slide-Level Foundation Models

In CVPR 2024 Seattle, we are presenting two new frameworks for building on top of UNI/CONCH features for building slide-level foundation models: (1) TANGLE, and (2) PANTHER. In the same way that UNI/CONCH are powerful, pretrained ROI encoders that can be used out-of-the-box for extracting features from ROIs, TANGLE and PANTHER are pretrained slide encoders that can be used out-of-the-box for extracting features from the WSI. No MIL supervision required!

TANGLE

Led by Guillaume Jaume and Lukas Oldenburg (published in CVPR 2024 ), we are excited to present TANGLE, a transcriptomics-guided slide representation learning framework based on multimodal contrastive learning. Building on top of the insights in CONCH in developing data-efficient vision encoders via multimodal pretraining, TANGLE uses: (1) inter-modal contrastive learning to align bulk transcriptomics features with its corresponding slide feature (from a trainable MIL that aggregates ROI features), (2) intra-modal contrastive learning to align slide features (using sampled bag of patch features as views), and (3) transcriptomics expression regression as a reconstruction task. The main objective in TANGLE is to pretrain the slide encoder such that it can extract powerful representations of WSIs. In doing so, we can deliver the same range of ROI capabilities demonstrated in UNI and CONCH for slide-level tasks.

We develop TANGLE across several rich histology datasets with paired transcriptomics data (TCGA and TG-GATES), with extensive evaluation on cancer subtyping tasks (internal BWH data, N=1265 breast slides and N=1946 lung slides) and lesion classification in rat liver studies (independent test fold in TG-GATES, N=4584 slides). An overview of TANGLE performance and further insights:

1. TANGLE is a strong few-shot learner: With as little as 25 slides per class for training, TANGLE can reach 0.90 AUC on both breast and lung cancer subtyping (evaluated on very large cancer cohorts).

2. Inter-modal outperforms Intra-modal contrastive learning: In carefully evaluating TANGLE components, we find that multimodal contrastive learning with transcriptomics outperforms unimodal contrastive learning using only WSIs. This follows our previous insights in CONCH for histopathology data and other vision-language pretraining studies in multimodality being a data-efficient paradigm for pretraining strong vision encoders.

3. We release TANGLE weights pretrained on UNI Features: On GitHub , we release pretrained ABMIL weights used in TANGLE with: (1) TCGA-BRCA pretraining, and also (2) TCGA pan-cancer pretraining. We think that this is an exciting addition for current users that have experimented using UNI, and hope that people would find the same prototypical and retrieval capabilities extending to the WSI. More to come!

PANTHER

With Andrew H. Song (published in CVPR 2024 ), we are excited to present PANTHER, a Prototype AggregatioN-based framework for compacT HEterogenous slide set Representations. PANTHER takes a radically different approach for slide-level SSL than HIPT and TANGLE. Motivated by the strong retrieval and prototypical learning capabilities in UNI, if the goal is to learn a "compact summary of the slide", why not build a strong baseline with a prototypical set of visual concepts first?

PANTHER works by: (1) applying K-Means Clustering globally across all patch features using all training WSIs to get a set of prototypes, (2) using these prototypes to initialize and fit a GMM via Expectation-Maximization (EM). To build a slide representation, we simply concatenate the GMM parameters (mixture probability, mean, covariance) for each component to form a single feature vector that represents the slide. For C=16 clusters, the extracted slide feature representation from PANTHER is 32784-dim.

We evaluate PANTHER on several diverse classification tasks (EBRAINS, PANDA, and more) and survival tasks (pan-cancer TCGA with independent validation on CPTAC and NLST for certain cancer types). An overview of PANTHER performance and further insights:

1. On PANTHER performance: Competitive performance classification tasks. Compared to 9 MIL methods, PANTHER is competitive on classification tasks (best on 5/7 evaluated classification metrics) and within the top-2 best performing models on all 9 survival tasks (best on 6/9 survival tasks).

2. PANTHER trains stable survival models: With pre-extracted slide representations, one can train linear or MLP classifiers with large batch sizes and thus Cox loss. In our main results and all ablation experiments in our study, the c-Index across all survival tasks remained above 0.6 (an average c-Index of 0.686), even with external validation on CPTAC and NLST.

3. PANTHER performance is dependent on strong ROI features: PANTHER with ImageNet or CTransPath features underperforms against MIL. With strong ROI features, it becomes worth diving into previous baselines such as H2T, and also unexplored baselines such as Optimal Transport Kernel Embedding (OT) for learning set representations.

(Aside) Comparing MIL architectures across pretrained encoders:

  • Performance gains with stronger MIL models diminish with stronger ROI encoders: With ImageNet features, models such as TransMIL have a +0.319 balanced accuracy improvement on EBRAINS over ABMIL. With CTransPath features, TransMIL still has a +0.124 improvement. With UNI features, the performance gap closes to +0.024 improvement. A similar trend in performance gap closing is seen on PANDA.
  • Simple MIL encoders with strong ROI features will outperform advanced MIL models with weak features: This point follows the previous discussion on having a stronger MIL model versus pretrained encoder. ABMIL with CTransPath features outperforms all MIL models with ResNet-50 features on classification tasks. Similarly, ABMIL with UNI features outperforms nearly all MIL models with CTransPath features on classification tasks. Survival tasks show less clear trends, but may be attributed to the limited expressivity of weakly-supervised MIL for this task type.

4. Mixture probabilities in PANTHER capture cardinality: On LUAD versus LUSC subtyping in TCGA-Lung (with independent evaluation on CPTAC-Lung), we plot the distribution of mixture probability of C=16 prototype. Based on the nearest ROIs for each prototype, we worked with Drew Williamson to understand the semantic meaning of each prototype. Though many prototypes refer to tumor, we find that ROIs assigned to C2 and C15 most always appear in LUAD, and ROIs assigned to C12 most always appeared as LUSC. Prototypical assignment maps of each ROI to its nearest prototype is shown in the figure below.

5. Limitations: As unsupervised slide representations in PANTHER are created using non-parametric techniques such as K-Means Clustering and GMMs (which rely on Euclidean distance or dot product to compare embeddings), we note the following limitations:

  • Dependent on the degree of dataset shift between the train and test distributions (due to variable H&E stain variability, known as image acquisition shift), prototype assignment for certain WSIs may lead to results in which all patches are assigned to a single prototype. This is exemplified in TCGA which has site-specific biases, and is thus an important consideration when using PANTHER (or any non-parametric approach) for histopathologic biomarker discovery.

  • When clustering over a WSI dataset composed of millions to billions of patches, clustering with only C=16 clusters will likely underfit the dataset, and also lead to collapse of all patches in a WSI falling under a single prototype. Empirically, we found C=16 to outperform C=32 in supervised settings. However, in settings such as biomarker discovery or unsupervised tissue segmentation, using more prototypes may improve performance.

6. Concluding thoughts and outlook: This work is built on several other important ideas in machine learning, computer vision, and computational pathology. The idea of clustering local visual features within a global image (to my knowledge) is motivated by early work on textons (Nature 1981 ), which has been modernized in computer vision (IJCV 2001 ) in conceptualizing the idea of learning "a visual vocabulary or codebook" via clustering. This later motivated the idea of "learning signatures" (CVPR 2003 ) for image classification, which performs local clustering of SIFT descriptors within each image, followed by using the set of centroids for each image in set-to-set comparisons via Earth Mover's Distance. This emerged somewhat concurrently with bag of visual words (BoVW) (CVPR 2003 , CVPR 2005 ) which performed global clustering of SIFT descriptors across all images and represented each image as co-occurring words, with BoVW ultimately gaining dominance (TPAMI 2005) and seeing application in pathology (AIM 2009 ). Fast-forwarding to the late 2010s and early 2020s, clustering-based and BoVW-inspired modeling for learning unsupervised set-based representations in WSIs have seen application with HPL (arXiv 2022 ) and H2T (MedIA 2023 ), but did not receive as much attention as developments in MIL. As the concluding chapter of my PhD, I am reminded of a talk by Alan Yuille in CVPR 2023 : Many ideas in AI seem to go round in round in circles - being forgotten and then rediscovered. However, with each revolution, we have access to better computing resources for computer vision (from processing ROIs to WSIs) and feature extractors (from SIFT descriptors to Transformer tokens), and make progress along a "performance dimension". Changing perspective, this circle shape is a helix. Another revolution will complete with the availability of powerful ROI encoders in computational pathology. with significant progress in slide-level self-supervised learning made via not only HIPT-like models with hierarchical pretraining and TANGLE-like models with multimodal pretraining, but also PANTHER-like models (and other classic ideas such as signatures and BoVW) due to their (1) simplicity in developing strong slide-level baselines, and (2) human interpretability in discovering meaningful morphological concepts which can be used for biomarker discovery.

Joshua Park

PhD Student @ University of Cambridge MRC Biostatistics Unit

5 个月

Congratulations Richard!!

回复
Jeanna Yu

Resident Physician, Obstetrics and Gynecology

6 个月

Congratulations!

Huda Alrashidi, PhD

Assistant Professor - Computer Science and Engineering Department

6 个月

Congratulations Dr Richard J. Chen ????

Yong-Moon Lee

Assistant Professor at Department of Pathology Dankook University

6 个月

Congrats!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了