Memento Learning: How OpenAI Created AI Agents that can Learn by Going Backwards
Jesus Rodriguez
CEO of IntoTheBlock, Co-Founder, Co-Founder of LayerLens, Faktory,and NeuralFabric, Founder of The Sequence AI Newsletter, Guest Lecturer at Columbia, Guest Lecturer at Wharton Business School, Investor, Author.
In this Thanksgiving Holiday, I wanted to look back at one of the most groundbreaking AI developments of last year.
19 years ago, Christopher Nolan produced one of his first most influential movies in the history of cinematography. Memento broke many of the traditional paradigms in the film industry by structuring two parallel narratives, one chronologically going backwards and one going forward. The novel form narrative implemented in Memento forces the audience to constantly reevaluate their knowledge of the plot and they keep learning small details every few minutes of the film. It turns out that replaying a knowledge sequence backwards for small time intervals is an incredibly captivating method of learning. Intuitively, the Memento form of learning seems like perfect for AI agents. Last year, researchers from OpenAI leveraged that learning methodology to created AI agents that learned to play Montezuma’s Revenge using a single demonstration.
Montezuma’s Revenge is a game often used in reinforcement learning demonstrations as it requires a player to take complex sequence of actions before producing any meaningful results. Typically, most approaches to solve Montezuma’s Revenge used complex forms of reinforcement learning such as Deep Q-Learning that require large collections of videos to train agents on the different game strategies. Those large datasets are incredibly hard to acquire and curate. To address that challenge, the team at OpenAI used a relatively new discipline in reinforcement learning known as one-shot learning.
One-Shot Reinforcement Learning
Conceptually, one-shot or few-shot learning is a form of reinforcement learning in which agents learn a policy with a relatively few iterations of experience. One-shot learning looks to resemble some of the characteristics of human cognitions that allow us to master a new task with a relatively small base knowledge. However, as appealing as one-shot learning sounds, it results incredibly difficult to implement as its very vulnerable to the famous exploration-exploitation reinforcement learning dilemma.
The traditional reinforcement learning theory is based on two fundamental methods: model-based and model-free. Model-based methods focused on learning as much as possible about a given environment and creating a model to represent it. From that perspective, model-based methods typically rely on other trained methods to better understand the correct actions to take on any given state of the environment. Model-free methods don’t ignore the characteristics of the environment and instead focused on learning a policy that produces the best outcome.
Most of the best-know model-free methods such as Policy-Gradients or Q-Learning take random actions in an environment and reinforce those actions that lead to a positive outcome. While this idea seems relatively trivial, it only works is the set of rewards is dense enough that any random action can be mapped to a reward. However, what happens when an agents needs to take a large sequence of actions without experiencing any specific reward? In reinforcement learning theory, this is known as the exploration-exploitation dilemma in which an agent needs to constantly balance short-term rewards vs. further exploration that can lead to bigger rewards. In the case of one-shot learning, using a small training set requires the agents to master exploration in order to learn the target task.
Let’s put this challenge in context by looking at Montezuma’s Revenge. In the game, the player needs to execute a large number of actions without obtaining any meaningful outcome such as getting a key.
For instance, in the previous video, the probability of obtaining the key can decomposed using the following formula:
p(get key) = p(get down ladder 1) * p(get down rope) * p(get down ladder 2) * p(jump over skull) * p(get up ladder 3).
That formula clearly illustrates the biggest challenge of one-shot learning models. The probability of obtaining a reward poses a complexity of exp(N) where N is the number of actions that takes to achieve any previous steps. Any model that shows this level of exponential complexity scaling is incredibly hard to implement in practical scenarios. How did OpenAI solves this challenge?
Creating a Memento Effect with Backward Demonstrations
The greatest insight of OpenAI is that while model-free reinforcement learning models have challenges orchestrating large sequence of actions, they are incredibly effective with shorter sequences. OpenAI decomposed a single video of Montezuma’s Revenge into a group of videos each one representing a specific task the agent needed to learn. Cleverly, they did this by going backwards on time ? ??
The OpenAI approach to one-shot learning starts by training the agent with an episode almost at the end of the demonstration. Once the agent can beat the demonstrator in the remaining part of the game, the training mechanism rolls the starting point of the episode back in time like the Memento black-and-white track ?? The model keeps doing that until the agent is able to play from the start of the game.
In the following sequence of images, the AI agent finds itself halfway up the ladder that leads to the key. Once it learns to climb the ladder from there, the model can have it start at the point where it needs to jump over the skull. After it learns to do that, the model can have it start on the rope leading to the floor of the room, etc. Eventually, the agent starts in the original starting state of the game and is able to reach the key completely by itself.
By creating sub-training episodes rolling the state of the demonstration back in time, the model is decomposing a large exploitation problem into small and easy to solve exploration problems. More importantly, this approach moves the complexity of the problem from exponential to linear. If a specific sequence of N actions is required to reach a reward, this sequence may now be learned in a time that is linear in N.
The OpenAI one-shot learning approach to solve Montezuma’s Revenge is incredibly clever. The specific reinforcement learning algorithm OpenAI used to learn Montezuma’s Revenge is a variant of Proximal Policy Optimization which is the same method that was used in their famous OpenAI Five system. Initial tests have a model scoring 74,500 points using a single demonstration. The code and demos are available on Github. In just a few months after its initial release, the learning methods pioneered by the Montezuma’s Revenge agents have served as inspiration for other learning methods that resemble the Memento model.
SEO blogger
5 年.
SEO blogger
5 年https://www.beingglad.com
Full-time mom | Account Manager | Сreate innovative custom software solutions, reduce the costs and eliminate the barriers to reach your business goals | FinTech | Blockchain solutions
5 年Thanks for the article - it's interesting. And I'm gonna watch the movie this weekend for sure :)