课程: Self-Supervised Machine Learning

The need for self-supervised learning

- [Instructor] As we hope to gather more and more information from the data that we have available, self-supervised learning is becoming more important. But what exactly is self-supervised learning and how does it help us? Let's start off with a definition of self-supervised learning. This is quite a mouthful, but I'm going to break this down. It's a learning technique which obtains supervisory signals from the data itself. Self-supervised learning techniques are a part of unsupervised learning in that they work with unlabeled data. They're able to look at unlabeled data and extract supervisory signals to set up a supervised task, starting with unlabeled data. Self-supervised learning works on the hypothesis that data itself has a lot of information. So if you're able to leverage the underlying structure in data, you should be able to create a supervised learning task in an automated manner from that unlabeled data in order to learn representations for that data. Now that we have a definition for self-supervised learning, let's take a step back and understand the context and the importance of learning from unlabeled data. Now, by latest measures, we produce 2.5 trillion bytes of data every day. That's a huge amount, and almost all of this is a raw unlabeled data. Data doesn't come labeled in its raw form. It turns out that most machine learning models are not set up to take advantage of unlabeled data. Most machine learning models are discriminated models. They are trained to predict labels. A class or category to which this particular data point belongs or a continuous value that is associated with this record. You're likely aware that building discriminated models requires supervised learning techniques and by nature, supervised learning requires labeled data, which means somebody has to go in and label all of the data available before it can become useful for the purposes of machine learning, which means most of the data available in the world not suitable for the most common ML model, such as classification or regression. It turns out this is a huge deal because getting labeled data is very, very hard. Getting labeled data involves actually curating and collecting the data that is interesting. The labeling process itself is extremely resource intensive. It's often manual, and it has to be crowdsourced before it can achieve any level of scale. Of course, it's massively time consuming and expensive, and this means that if you limit learning for your ML models to label the data, you are essentially giving up lots of opportunities because most of the data in the world, as we discussed, is unlabeled. For example, if you worked with computer vision models, you have likely heard of the ImageNet dataset, which is the gold standard for computer vision problems. - This dataset has 14 million labeled images, and it took 22 years to create. To put things in context, there are only 1 million images out there with category and bounding box annotations. Only 1 million or so images with the right labels to work for object detection problems. For example, there are 14 million images, that is the ImageNet dataset with image level annotations. There are 1 trillion internet photos and more than 1 trillion photos in the real world. If labeled data is all that we can use, well, there is this entire world that we cannot learn from. So this ends up being a real problem. This is why researchers started looking for alternate labeling processes for their data. These are labels that are automatically available along with the data such as hashtags for social media posts, even images. GPS location for images. Sometimes they're embedded with the image itself. Researchers are also actively working on machine learning techniques that can automate the labeling process. The Snorkel model from Stanford comes to mind in this particular category. Well, the last option is you have the data, why can't you learn from the data itself? And this is exactly why self-supervised learning is becoming increasingly important. Self-supervised learning can learn representations from unlabeled data, and these learned representations can be used in a number of different machine learning tasks. So what's the fundamental idea behind self-supervised learning? Well, if you have data, raw data with no labels, this can be used as the input. The input to a self-supervised model is just data with no labels. The model then splits this input into two parts. One is the observed part that it can see that makes up the X variables. The second is the hidden part. This is the Y variable that the model has to predict. The self-supervised model is set up in such a way that it uses the data available, the observed portion of the data to predict the hidden portion of the data. And in the process of trying to predict the hidden part from the observed part, self-supervised learning learns late representations of the underlying data. Self-supervised learning thus predicts the unobserved or hidden part of the input from any observed or known part of the input.

内容