The Next AI Revolution: Self-Supervised Learning
Ibrahim Sobh - PhD
?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer
Babies learn how the world works by observation, with remarkably little interaction.
"Self-supervised learning is the cake, supervised learning is the icing on the cake, reinforcement learning is the cherry on the cake" Yann LeCun
Machine Learning
1- Supervised Learning: learning using data with fine-grained human-annotated labels for training.
However, data collection and annotation usually are expensive in terms of time and cost. Additionally, in some domains, the annotation process requires special skills. Accordingly, semi-supervised, weakly supervised, and self supervised learning methods are proposed to reduce that cost.
2- Semi-supervised Learning: learning using a small amount of labeled data in addition to a large amount of (easy to have) unlabeled data.
3- Weakly-supervised Learning: learning with coarse-grained (noisy or not accurate) labels, can be collected at much lower coast.
4- Unsupervised Learning: learning without using any annotations.
5- Self-supervised Learning: a subset of unsupervised learning methods, refers to learning methods, where the models are explicitly trained with automatically generated labels.
Self-supervised learning empowers us to exploit a variety of labels that come with the data for free.
Transfer Learning
In deep learning, training a neural network starting from random weights is not an easy task. It is more practical to start neural network training with a pre-trained model on a source task, and then fine tune it towards the target task. Accordingly, 1000x less data can be used (for fine tuning) compared to starting from scratch.
Generally speaking, using a few early layers from a pre-trained generic model (ImageNet for example) can improve the speed of training, and accuracy of the model. However, if the target task is not similar to the source task, the improvement is not that good.
One solution could be Self-Supervised Learning, where a model is trained using labels that are part of the data itself without external (costly) labels.
Self-supervised learning is used widely in natural language processing (NLP), and it is used much less in computer vision.
Self-supervised learning in computer vision
In self-supervised learning the task used for pre-training is known as the “pretext task”. The tasks that used for fine tuning are known as the “downstream tasks”.?
Usually, we don’t care much about the performance of the invented (source) task used for pre-training. Rather we care about the learned intermediate representation and hope that this representation can be beneficial to a variety of practical downstream (target) tasks. This is similar to how auxiliary tasks are treated.
领英推荐
For example, images can be rotated at random and a model is trained to predict how input image is rotated (pretext task). This required no annotations. Accordingly, the learned intermediate representations are expected to be beneficial for the downstream tasks.
Choosing a pretext task
The task should to be something that, if solved, would require an understanding of the data which would also be needed to solve the downstream task. Moreover, the pretext task is something that a human can do based on the required understanding. For example, a pretext task that generates a future frame (next frame or next few frames) of a video is possible, however generates a very far future frame is not. Generally, there is no need to spend too much time creating the perfect pretext model. Moreover, could learning multiple tasks at once (multi-task learning) is also possible.
Examples:
Many ideas have been proposed for self-supervised representation learning on images. A common workflow is to train a model on one or multiple pretext tasks with unlabelled images and then use one intermediate feature layer of this model to feed a multinomial logistic regression classifier on ImageNet classification. The final classification accuracy quantifies how good the learned representation is.
1- Rotation of an image (Gidaris et al. 2018 ) is a cheap way to modify an input image while the semantic content stays unchanged. Each input image is first rotated by a multiple of at random. Then, the model is trained to predict which rotation has been applied, thus a 4-class classification problem.
2- The denoising autoencoder (Vincent, et al, 2008 ) learns to recover an image from a version that is partially corrupted or has random noise. "As we increase the noise level, denoising training forces the filters to differentiate more, and capture more distinctive features. Higher noise levels tend to induce less local filters, as expected. One can distinguish different kinds of filters, from local blob detectors, to stroke detectors, and some full character detectors at the higher noise levels."
3- Context Encoders: Feature Learning by Inpainting (Pathak, et al., 2016 ), where the network is trained to fill in a missing piece in the image. In this work, unsupervised visual feature learning algorithm is presented, where Context Encoders are a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s).
4- Predicting the relative position or random patches from one image (Doersch et al. 2015 ) . As mentioned in the paper: "Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first. We argue that doing well on this task requires the model to learn to recognize objects and their parts."
5- Validate frame order (Misra, et al 2016 ). The pretext task is to determine whether a sequence of frames from a video is placed in the correct temporal order. As mentioned in the paper: "A video imposes a natural temporal structure for visual data. In many cases, one can easily verify whether frames are in the correct temporal order (shuffled or not). Such a simple sequential verification task captures important spatiotemporal signals in videos. We use this task for unsupervised pre-training of a Convolutional Neural Network (CNN)."
References and further readings:
Course SSL
Best Regards
Product Management Leader | Technical Strategist | Innovator | Entrepreneur
4 年Great work sir