登录查看更多内容

Pseudo Labeling

Ivan Isaev

ML tech-lead and senior engineer | Ex-Head of ML & DS | Ex-Head of Engineering | Kaggle Competitions Master

发布日期: 2025年2月16日

Pseudo Labeling (Lee 2013) assigns fake labels to unlabeled samples based on the maximum softmax probabilities predicted by the current model and then trains the model on both labeled and unlabeled samples simultaneously in a pure supervised setup.

Why could pseudo labels work? Pseudo label is in effect equivalent to Entropy Regularization (Grandvalet & Bengio 2004), which minimizes the conditional entropy of class probabilities for unlabeled data to favor low density separation between classes. In other words, the predicted class probabilities is in fact a measure of class overlap, minimizing the entropy is equivalent to reduced class overlap and thus low density separation.

Training with pseudo labeling naturally comes as an iterative process. We refer to the model that produces pseudo labels as teacher and the model that learns with pseudo labels as student.

Fig. 9. t-SNE visualization of outputs on MNIST test set by models training (a) without and (b) with pseudo labeling on 60000 unlabeled samples, in addition to 600 labeled data. Pseudo labeling leads to better segregation in the learned embedding space. (Image source:

How to apply pseudo labeling

Pseudo labeling is a 5 step process.

(1) Build a model using training data.

(2) Predict labels for an unseen test dataset.

(3) Add confident predicted test observations to our training data

(4) Build a new model using combined data.

And (5) use your new model to predict the test data and submit to Kaggle. Here is a pictorial explanation using sythetic 2D data.

After pseudo labelling: QDA has found better ellipses then before

Why pseudo labeling works

How pseudo labeling works is best understood with QDA (Quadratic Discriminant Analysis). QDA works by using points in p-dimensional space to find hyper-ellipsoids, see here. With more points, QDA can better estimate the center and shape of each ellipsoid (and consequently make better predictions afterward).

Great description is there.

领英推荐

Feature Engineering

Dr. John Martin 1 年前

Principal Component Analysis (PCA)

Bluechip Technologies Asia 8 个月前

Decision Tree

Bluechip Technologies Asia 10 个月前

Where to start

This notebook made on anonymized binary classification dataset provides simple end-to-end pipeline of how to approach pseudo labelling to improve scores of your model

Where to go further

Noisy samples as learning targets

Several recent consistency training methods learn to minimize prediction difference between the original unlabeled sample and its corresponding augmented version.

Label propagation

Label Propagation (Iscen et al. 2019) is an idea to construct a similarity graph among samples based on feature embedding. Then the pseudo labels are “diffused” from known samples to unlabeled ones where the propagation weights are proportional to pairwise similarity scores in the graph. Conceptually it is similar to a k-NN classifier and both suffer from the problem of not scaling up well with a large dataset.

Reducing confirmation bias

Confirmation bias is a problem with incorrect pseudo labels provided by an imperfect teacher model. Overfitting to wrong labels may not give us a better student model.

To reduce confirmation bias, Arazo et al. (2019) proposed two techniques.

FixMatch

FixMatch (Sohn et al. 2020) generates pseudo labels on unlabeled samples with weak augmentation and only keeps predictions with high confidence. Here both weak augmentation and high confidence filtering help produce high-quality trustworthy pseudo label targets. Then FixMatch learns to predict these pseudo labels given a heavily-augmented sample.

Combining with Powerful Pre-Training

It is a common paradigm, especially in language tasks, to first pre-train a task-agnostic model on a large unsupervised data corpus via self-supervised learning and then fine-tune it on the downstream task with a small labeled dataset. Research has shown that we can obtain extra gain if combining semi-supervised learning with pretraining.

You can read about these advanced things in more details in this article.

要查看或添加评论，请登录

Ivan Isaev的更多文章

Quatitative interview task: human approach vs AI approach

2025年3月6日

Quatitative interview task: human approach vs AI approach

It is interesting to comare human approach to solving tasks reqired knowleage of some theorems with how current AI…
Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

2025年2月28日

Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

What is Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)? Group-wise Precision Quantization…
Learning to distill ML models

2025年2月14日

Learning to distill ML models

I’m investigating the topic of ML models distillation and learning to do that. These are my takeaways with the links to…
Kaggle Santa 2024 and what do the puzzles have to do with it?

2025年2月8日

Kaggle Santa 2024 and what do the puzzles have to do with it?

Our team got 23-rd place in Santa 2024 with a silver medal. We were close to gold but not this time.
Qdrant and other vector DBs

2025年1月28日

Qdrant and other vector DBs

Issue with vector DB size There are plenty of vector DBs available including FAISS, OpenSearch, Milvous, Pinackle…
Chutes: did you try it?

2025年1月21日

Chutes: did you try it?

Hi there I found one thing and want to ask if you tried it. It named Chutes and could be found there https://chutes.

3 条评论
InternVL2 test drive

2024年11月26日

InternVL2 test drive

Intern_vl2 Is one another vision language model I tried some time ago and I like it a lot. It is quite fast (10 times…
VITA multimodal LLM

2024年11月25日

VITA multimodal LLM

Lately, I've been working a lot with multimodal LLMs to generate video descriptions. This post is about the multimodal…
What are Diffusion Models?

2024年5月29日

What are Diffusion Models?

Diffusion models is one of the hottest topics now. This short post is just a reminder what is this and how they emerged…
4 Neural Network Activation Functions you should keep in mind

2024年5月24日

4 Neural Network Activation Functions you should keep in mind

What is a Neural Network Activation Function (AF)? Why are deep neural networks hard to train? What is "rule of thumb"…

See all articles

Pseudo Labeling

Ivan Isaev

ML tech-lead and senior engineer | Ex-Head of ML & DS | Ex-Head of Engineering | Kaggle Competitions Master

How to apply pseudo labeling

Why pseudo labeling works

领英推荐

Where to start

Where to go further

Ivan Isaev的更多文章

社区洞察

其他会员也浏览了

Decision Tree

8 Steps to Building a Machine Learning Model for Classification

How to create a train and test dataset

Bagging , Random Forest and Adaboost

Task #1 - Prediction using Supervised ML

Day 10 - K-Means Clustering

Difference Between Vector DB and Graph DB in RAG Applications

Understanding Cross-Validation: Different Approaches

Concept and business relevance of binomial distribution

5 Questions to Ask About Fuzzy Logic

How to apply pseudo labeling

Why pseudo labeling works

领英推荐

Where to start

Where to go further

Ivan Isaev的更多文章

Quatitative interview task: human approach vs AI approach

Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

Learning to distill ML models

Kaggle Santa 2024 and what do the puzzles have to do with it?

Qdrant and other vector DBs

Chutes: did you try it?

InternVL2 test drive

VITA multimodal LLM

What are Diffusion Models?

4 Neural Network Activation Functions you should keep in mind

社区洞察

其他会员也浏览了

Decision Tree

8 Steps to Building a Machine Learning Model for Classification

How to create a train and test dataset

Bagging , Random Forest and Adaboost

Task #1 - Prediction using Supervised ML

Day 10 - K-Means Clustering

Difference Between Vector DB and Graph DB in RAG Applications

Understanding Cross-Validation: Different Approaches

Concept and business relevance of binomial distribution

5 Questions to Ask About Fuzzy Logic