Segmentation with Clicks

Generating ground truth segmentation masks is pretty challenging and time consuming task even today. We have come a long way from annotating each pixel with its corresponding classes and using graph cuts to use deep learning models to make annotation a convenient process. Yet, this is still a challenging task to improve, especially in medical imaging.

Medical Image data, such as CT scan, MRI, OCT, are challenging to acquire since they need to be collected from actual human beings. We have to understand that this data is completely different from traditional data we generally use such as cars or any class from the Imagenet dataset. Hence, these models fail to capture bone, tissue correspondence in the image. If you have a machine learning background, you might already know what I am talking about. YES, you are right, Transfer Learning is what we need here.

Simple Background: Transfer Learning is basically you find a model that has been trained on a different data distribution (For example: Model Trained on Imagenet) and modifying (or Tuning) your model weights so that it performs well on YOUR dataset.

Welcome to my first article, I will be talking about improving annotation and segmentation using deep learning models. Grab some chips and enjoy.

Transformer models were initially introduced in natural language processing from the paper Attention Is All You Need by Google Brain and University of Toronto. Transformers use attention mechanism (psst... they did not introduce attention first) to make predictions. You can think of attention mechanism as when you see a sentence, let's say we have a sentence 'The cat sat on the mat'. We can immediately notice that important words are 'cat', 'sat', 'on', 'mat', in fact we can still get the same context only looking at the words 'cat sat on mat', the context is still the same. Intuitively, we can say that these words deserve higher weightage in the sentence than words such as 'the' that can also be classified as 'stop words'. Soon these transformers made appearance into images named Vision Transformers that dominated convolutional neural networks (mostly).

Segment Anything model is a Vision Transformer model that can be controlled with prompts. Prompt can be text, click, box, anything that is an input that can control what you want as an output. It is introduced by researchers in Meta and has quickly gained everyone's attention (<= this is the attention mechanism : ) ).

As you can see in Figure 1, segment anything model consists of three components, Image Encoder, Prompt Encoder, Mask Decoder. Image Encoder is a Vision Transformer (ViT) model that takes an image and provides patch wise features. Prompt Encoder encodes prompt information such as points, box, text. Mask decoder takes patch wise features from ViT and prompt encodings to generate masks and rank them based on confidence. This model has been trained on over 1 billion masks (good luck if you want to do it at home) and hence has a great ability to capture boundaries and objects accurately as we can see in Figure 2.

Prompt with click, can be very easy to make and hence can greatly improve efficiency in annotating medical datasets. And thus, I introduce you guys to ClickSAM . ClickSAM can accurately predict masks of breast tumor in ultrasound images, more importantly it revisits the idea of click based training.

ClickSAM

Click Based Training is not new, we train the model by accepting user input, in this approach, we use the available segmentation ground truths to simulate user clicks. ClickSAM applies this principle to fine-tune the segment anything model to perform prompt based ultrasound image segmentation and segment breast cancer.

Click Generation

Before we can go into this, I want you to know what True Positive, False Positive, False negatives. In short, If you predict a pixel to be tumor (say 1) but in reality that pixel is not tumor ( pixel label is 0) then it is a false positive. You falsely said positive for a tumor, similarly if you falsely say negative for tumor it is a false negative.

Pixel Label: 0 - Prediction: 0, it is a True Negative
Pixel Label: 0 - Prediction: 1, it is a False Positive
Pixel Label: 1 - Prediction: 0, it is a False Negative

Pixel Label: 1 - Prediction: 1, it is a True Positive

Figure 3 above, illustrates the idea of false positives and negatives in segmentation with respect to the ground truth segmentation.

Generating Clicks

We have to generate clicks in such a way, that if you get a false negative, you need to add that region into your predicted mask, similarly, if you get a false positive, you need to remove that region from your predicted mask to make it accurate with the ground truth.

Let’s say we have two types of clicks, positive clicks add region to prediction, negative clicks remove region from prediction. Figure 4 illustrates positive and negative clicks, green clicks are considered positive, red clicks are considered negative. Once, we have these clicks, we train the model multiple times until we get good enough masks. How do we generate these clicks? We use Voronoi Tessellation for this purpose. This is similar to clustering, in fact K-Means clustering comes from this algorithm. If you have points, you partition the given space into small partitions such that each point can have its own “cluster”.

Steps to generate clicks for each segment (For each FP, FN,TP,TN):

Initialize random clicks
Compute energy of the pixels: find the nearest point, calculate the distance to that point, take inverse of the distance you will get energy.
Apply Simulated Annealing on click coordinates within the segment region to get set of click coordinates that maximize energy.

Simulated Annealing can be explained simply as: if you are currently at a state and next states have higher reward then you select the best state, else, you select next state with a probability. So that the end state will be optimum (approximately global optimum).

Now that we know how to generate clicks, we can generate initial clicks by using Ground Truth as Prediction. Then, train the model, later, use predictions to further finetune.

Figure 6 shows the pipeline to train the model, quickly summarized as, you provide the image and prompts as input, then generate initial segmentations, compare with ground truth, generate positive and negative clicks using voronoi tiling algorithm, finally train the model.

Results

Figure 7 shows the prediction compared to ground truths, ClickSAM is compared to MedSAM here, MedSAM is a similar model that use prompts but they use bounding box.

Mean IOU comparison on Ultrasound Breast Dataset
Segment Click Train (Encoding Points directly into image) : 70.7%
MedSAM (SAM finetuned with bounding box (w prompt encoder) ): 86.15%
ClickSAM (SAM finetuned with generated clicks (w prompt encoder) ): 94.39%

Conclusion

This article aims to bring more light into the potential of user feedback guided self training to drastically improve model's performance irrespective of the domain. But, the article also emphasizes that for initial training, we do not actually need user feedback but can simulate a user feedback environment that the model learns to adapt which can later be used to finetune. This is also a huge benefit because the model will become better every time you use it. Meaning it can be deployed while it improves its accuracy further to fit the use case. Hope you like my writing. You can read more about my work ClickSAM here.

Segmentation with Clicks

Hemanth P

Machine Learning Researcher | MS CS

ClickSAM

Click Generation

Generating Clicks

领英推荐

Results

Conclusion

社区洞察

其他会员也浏览了

?? Deep Learning and Neural Networks on Treasury! ??

What Is Stable Diffusion and How Does It Work?

"Unleashing the Power of Deep Learning: Transforming the Future of AI"

Everything You Need to Know About Image Recognition

Explainable Artificial Intelligence(XAI)

Understanding ANN

A Primer on Modern AI

Encoder decoder to Transfer learning: An analysis of all research papers contributed towards journey of Transformers Architecture (LLM's)

AI Foundation Models in Biotech: New Paradigm

Do You Understand The Difference Between Deep Learning And Neural Networks?