Yann LeCun: "Energy-Based Self-Supervised Learning"
Yann LeCun gave a talk at the Institute for Pure and Applied Mathematics in December 2019 entitled “Energy-Based Self-Supervised Learning”, which was a preview of his talk he gave at the recent ICLR 2020. The 2019 talk is on YouTube (https://www.youtube.com/watch?v=A7AnCvYDQrU&t=45s) and I watched it and produced the following.
A key point is that machine learning sample efficiency is much much lower than humans or even animals in a learning context. Supervised learning is pretty inefficient, often requiring huge amounts of data and massive power consumption to train state-of-the-art models. Reinforcement learning is even worse! Yann gave the examples that it took 20 million self-play games of GO, requiring 2000 GPUs for 14 days, to learn the game at a human level. (See this recent post from me: https://www.dhirubhai.net/posts/blainebateman_ai21-labs-asks-how-much-does-it-cost-to-activity-6662131065588633600-eHj_)
To master Atari games it takes 83 hours of machine self-play to achieve the performance most humans do in 15 minutes. So a question is how do humans and animals learn so quickly?
Yann said a key part of this is learning by observation. Children can learn world models with very little input; for instance babies learn face tracking, and toddlers learn “intuitive physics” by about 9 months. Yann said that by that time, a child learns to predict what should happen—things fall if you drop them, etc. He then said “Prediction is the essence of intelligence, in my opinion.” With that was the segue into Self Supervised Learning.
He described self-supervised learning as learning to “predict any part of the input from any other part”; examples are the future from the past, as well as the masked from the visible. He then talked about the Google Brain introduction of the Transformer model, which uses 100s of millions of parameters and is trained on billions of words. As an example of what such a model can do, you can remove a random 10% of a text and the model can predict the missing words. An interesting aspect of this is that it produces a probability vector over ALL words. Yann pointed out that this strategy works well on NLP problems but more poorly on images or other problems.
Yann introduced the energy model paradigm for self-supervised learning. By way of introduction, some key quotes from him: “You DO NOT want to learn distributions! They are bad for you”. And: “Maximum likelihood sucks!” As well as “Applying probability theory blindly actually is bad for you.” He gave intuition for energy models as say you have a thing you want to predict. If the function you are generating which is the prediction is a “good” representation then you design it such that energy is low. If it is a “bad” representation, energy is high. With reference to the image, if the dotted path is what you are trying to predict, you want energy low along that path and high everywhere else. The analogy is you find a function that pushes energy “down” along the path, and pushes “up” everywhere else. This analogy nicely generalizes to high dimensional spaces which are what is of interest in most real problems.
Credit: Yann LeCun, YouTube, Dec. 2019
The goal here is to learn things without labeling. Yann quoted Jitendra Malik: "Labels are the opium of the machine learning researcher". A good portion of the remainder of the talk was describing how you go about training such a model, and I won’t review that as a lot of the math is over my head. But consider today where an entire industry has sprung up to label data for people trying to train models. If you are in that industry, the promise of energy based models trained via self-supervision is clearly disruptive. It is rare you get such a clear advanced warning of an industry disruption!
Interim CDO@MiiHealth | AI for Medicine 3.0
4 年I am a strong supporter of Professor Lecun's arguments in favor of EBMs. I would like to ask your opinion on promising convergence/inspiration from Adiabatic Optimization to an breakthrough in EBMs. When I learnt about Adiabatic Optimization and Quantum Annealing, I immediately had the feeling that this is the right direction to take up research efforts in.