Pseudo Labeling

Pseudo Labeling

Pseudo Labeling (Lee 2013) assigns fake labels to unlabeled samples based on the maximum softmax probabilities predicted by the current model and then trains the model on both labeled and unlabeled samples simultaneously in a pure supervised setup.

Why could pseudo labels work? Pseudo label is in effect equivalent to Entropy Regularization (Grandvalet & Bengio 2004), which minimizes the conditional entropy of class probabilities for unlabeled data to favor low density separation between classes. In other words, the predicted class probabilities is in fact a measure of class overlap, minimizing the entropy is equivalent to reduced class overlap and thus low density separation.

Training with pseudo labeling naturally comes as an iterative process. We refer to the model that produces pseudo labels as teacher and the model that learns with pseudo labels as student.


Fig. 9. t-SNE visualization of outputs on MNIST test set by models training (a) without and (b) with pseudo labeling on 60000 unlabeled samples, in addition to 600 labeled data. Pseudo labeling leads to better segregation in the learned embedding space. (Image source:

How to apply pseudo labeling

Pseudo labeling is a 5 step process.

(1) Build a model using training data.

(2) Predict labels for an unseen test dataset.

(3) Add confident predicted test observations to our training data

(4) Build a new model using combined data.

And (5) use your new model to predict the test data and submit to Kaggle. Here is a pictorial explanation using sythetic 2D data.


QDA ellipses before pseudo labelling


After pseudo labelling: QDA has found better ellipses then before

Why pseudo labeling works

How pseudo labeling works is best understood with QDA (Quadratic Discriminant Analysis). QDA works by using points in p-dimensional space to find hyper-ellipsoids, see here. With more points, QDA can better estimate the center and shape of each ellipsoid (and consequently make better predictions afterward).

Great description is there.

Where to start

This notebook made on anonymized binary classification dataset provides simple end-to-end pipeline of how to approach pseudo labelling to improve scores of your model

Where to go further

Noisy samples as learning targets

Several recent consistency training methods learn to minimize prediction difference between the original unlabeled sample and its corresponding augmented version.

Label propagation

Label Propagation (Iscen et al. 2019) is an idea to construct a similarity graph among samples based on feature embedding. Then the pseudo labels are “diffused” from known samples to unlabeled ones where the propagation weights are proportional to pairwise similarity scores in the graph. Conceptually it is similar to a k-NN classifier and both suffer from the problem of not scaling up well with a large dataset.


Reducing confirmation bias

Confirmation bias is a problem with incorrect pseudo labels provided by an imperfect teacher model. Overfitting to wrong labels may not give us a better student model.

To reduce confirmation bias, Arazo et al. (2019) proposed two techniques.

FixMatch

FixMatch (Sohn et al. 2020) generates pseudo labels on unlabeled samples with weak augmentation and only keeps predictions with high confidence. Here both weak augmentation and high confidence filtering help produce high-quality trustworthy pseudo label targets. Then FixMatch learns to predict these pseudo labels given a heavily-augmented sample.


Combining with Powerful Pre-Training

It is a common paradigm, especially in language tasks, to first pre-train a task-agnostic model on a large unsupervised data corpus via self-supervised learning and then fine-tune it on the downstream task with a small labeled dataset. Research has shown that we can obtain extra gain if combining semi-supervised learning with pretraining.

You can read about these advanced things in more details in this article.

要查看或添加评论,请登录

Ivan Isaev的更多文章

  • Quatitative interview task: human approach vs AI approach

    Quatitative interview task: human approach vs AI approach

    It is interesting to comare human approach to solving tasks reqired knowleage of some theorems with how current AI…

  • Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

    Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

    What is Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)? Group-wise Precision Quantization…

  • Learning to distill ML models

    Learning to distill ML models

    I’m investigating the topic of ML models distillation and learning to do that. These are my takeaways with the links to…

  • Kaggle Santa 2024 and what do the puzzles have to do with it?

    Kaggle Santa 2024 and what do the puzzles have to do with it?

    Our team got 23-rd place in Santa 2024 with a silver medal. We were close to gold but not this time.

  • Qdrant and other vector DBs

    Qdrant and other vector DBs

    Issue with vector DB size There are plenty of vector DBs available including FAISS, OpenSearch, Milvous, Pinackle…

  • Chutes: did you try it?

    Chutes: did you try it?

    Hi there I found one thing and want to ask if you tried it. It named Chutes and could be found there https://chutes.

    3 条评论
  • InternVL2 test drive

    InternVL2 test drive

    Intern_vl2 Is one another vision language model I tried some time ago and I like it a lot. It is quite fast (10 times…

  • VITA multimodal LLM

    VITA multimodal LLM

    Lately, I've been working a lot with multimodal LLMs to generate video descriptions. This post is about the multimodal…

  • What are Diffusion Models?

    What are Diffusion Models?

    Diffusion models is one of the hottest topics now. This short post is just a reminder what is this and how they emerged…

  • 4 Neural Network Activation Functions you should keep in mind

    4 Neural Network Activation Functions you should keep in mind

    What is a Neural Network Activation Function (AF)? Why are deep neural networks hard to train? What is "rule of thumb"…

社区洞察

其他会员也浏览了