Pseudo Labeling
Ivan Isaev
ML tech-lead and senior engineer | Ex-Head of ML & DS | Ex-Head of Engineering | Kaggle Competitions Master
Pseudo Labeling (Lee 2013) assigns fake labels to unlabeled samples based on the maximum softmax probabilities predicted by the current model and then trains the model on both labeled and unlabeled samples simultaneously in a pure supervised setup.
Why could pseudo labels work? Pseudo label is in effect equivalent to Entropy Regularization (Grandvalet & Bengio 2004), which minimizes the conditional entropy of class probabilities for unlabeled data to favor low density separation between classes. In other words, the predicted class probabilities is in fact a measure of class overlap, minimizing the entropy is equivalent to reduced class overlap and thus low density separation.
Training with pseudo labeling naturally comes as an iterative process. We refer to the model that produces pseudo labels as teacher and the model that learns with pseudo labels as student.
How to apply pseudo labeling
Pseudo labeling is a 5 step process.
(1) Build a model using training data.
(2) Predict labels for an unseen test dataset.
(3) Add confident predicted test observations to our training data
(4) Build a new model using combined data.
And (5) use your new model to predict the test data and submit to Kaggle. Here is a pictorial explanation using sythetic 2D data.
Why pseudo labeling works
How pseudo labeling works is best understood with QDA (Quadratic Discriminant Analysis). QDA works by using points in p-dimensional space to find hyper-ellipsoids, see here. With more points, QDA can better estimate the center and shape of each ellipsoid (and consequently make better predictions afterward).
Great description is there.
Where to start
This notebook made on anonymized binary classification dataset provides simple end-to-end pipeline of how to approach pseudo labelling to improve scores of your model
Where to go further
Noisy samples as learning targets
Several recent consistency training methods learn to minimize prediction difference between the original unlabeled sample and its corresponding augmented version.
Label propagation
Label Propagation (Iscen et al. 2019) is an idea to construct a similarity graph among samples based on feature embedding. Then the pseudo labels are “diffused” from known samples to unlabeled ones where the propagation weights are proportional to pairwise similarity scores in the graph. Conceptually it is similar to a k-NN classifier and both suffer from the problem of not scaling up well with a large dataset.
Reducing confirmation bias
Confirmation bias is a problem with incorrect pseudo labels provided by an imperfect teacher model. Overfitting to wrong labels may not give us a better student model.
To reduce confirmation bias, Arazo et al. (2019) proposed two techniques.
FixMatch
FixMatch (Sohn et al. 2020) generates pseudo labels on unlabeled samples with weak augmentation and only keeps predictions with high confidence. Here both weak augmentation and high confidence filtering help produce high-quality trustworthy pseudo label targets. Then FixMatch learns to predict these pseudo labels given a heavily-augmented sample.
Combining with Powerful Pre-Training
It is a common paradigm, especially in language tasks, to first pre-train a task-agnostic model on a large unsupervised data corpus via self-supervised learning and then fine-tune it on the downstream task with a small labeled dataset. Research has shown that we can obtain extra gain if combining semi-supervised learning with pretraining.
You can read about these advanced things in more details in this article.