登录查看更多内容

Memento Learning: How OpenAI Created AI Agents that can Learn by Going Backwards

Jesus Rodriguez

CEO of IntoTheBlock, Co-Founder, Co-Founder of LayerLens, Faktory,and NeuralFabric, Founder of The Sequence AI Newsletter, Guest Lecturer at Columbia, Guest Lecturer at Wharton Business School, Investor, Author.

发布日期: 2019年11月28日

In this Thanksgiving Holiday, I wanted to look back at one of the most groundbreaking AI developments of last year.

19 years ago, Christopher Nolan produced one of his first most influential movies in the history of cinematography. Memento broke many of the traditional paradigms in the film industry by structuring two parallel narratives, one chronologically going backwards and one going forward. The novel form narrative implemented in Memento forces the audience to constantly reevaluate their knowledge of the plot and they keep learning small details every few minutes of the film. It turns out that replaying a knowledge sequence backwards for small time intervals is an incredibly captivating method of learning. Intuitively, the Memento form of learning seems like perfect for AI agents. Last year, researchers from OpenAI leveraged that learning methodology to created AI agents that learned to play Montezuma’s Revenge using a single demonstration.

Montezuma’s Revenge is a game often used in reinforcement learning demonstrations as it requires a player to take complex sequence of actions before producing any meaningful results. Typically, most approaches to solve Montezuma’s Revenge used complex forms of reinforcement learning such as Deep Q-Learning that require large collections of videos to train agents on the different game strategies. Those large datasets are incredibly hard to acquire and curate. To address that challenge, the team at OpenAI used a relatively new discipline in reinforcement learning known as one-shot learning.

One-Shot Reinforcement Learning

Conceptually, one-shot or few-shot learning is a form of reinforcement learning in which agents learn a policy with a relatively few iterations of experience. One-shot learning looks to resemble some of the characteristics of human cognitions that allow us to master a new task with a relatively small base knowledge. However, as appealing as one-shot learning sounds, it results incredibly difficult to implement as its very vulnerable to the famous exploration-exploitation reinforcement learning dilemma.

The traditional reinforcement learning theory is based on two fundamental methods: model-based and model-free. Model-based methods focused on learning as much as possible about a given environment and creating a model to represent it. From that perspective, model-based methods typically rely on other trained methods to better understand the correct actions to take on any given state of the environment. Model-free methods don’t ignore the characteristics of the environment and instead focused on learning a policy that produces the best outcome.

Most of the best-know model-free methods such as Policy-Gradients or Q-Learning take random actions in an environment and reinforce those actions that lead to a positive outcome. While this idea seems relatively trivial, it only works is the set of rewards is dense enough that any random action can be mapped to a reward. However, what happens when an agents needs to take a large sequence of actions without experiencing any specific reward? In reinforcement learning theory, this is known as the exploration-exploitation dilemma in which an agent needs to constantly balance short-term rewards vs. further exploration that can lead to bigger rewards. In the case of one-shot learning, using a small training set requires the agents to master exploration in order to learn the target task.

Let’s put this challenge in context by looking at Montezuma’s Revenge. In the game, the player needs to execute a large number of actions without obtaining any meaningful outcome such as getting a key.

For instance, in the previous video, the probability of obtaining the key can decomposed using the following formula:

p(get key) = p(get down ladder 1) * p(get down rope) * p(get down ladder 2) * p(jump over skull) * p(get up ladder 3).

That formula clearly illustrates the biggest challenge of one-shot learning models. The probability of obtaining a reward poses a complexity of exp(N) where N is the number of actions that takes to achieve any previous steps. Any model that shows this level of exponential complexity scaling is incredibly hard to implement in practical scenarios. How did OpenAI solves this challenge?

Creating a Memento Effect with Backward Demonstrations

The greatest insight of OpenAI is that while model-free reinforcement learning models have challenges orchestrating large sequence of actions, they are incredibly effective with shorter sequences. OpenAI decomposed a single video of Montezuma’s Revenge into a group of videos each one representing a specific task the agent needed to learn. Cleverly, they did this by going backwards on time ? ??

The OpenAI approach to one-shot learning starts by training the agent with an episode almost at the end of the demonstration. Once the agent can beat the demonstrator in the remaining part of the game, the training mechanism rolls the starting point of the episode back in time like the Memento black-and-white track ?? The model keeps doing that until the agent is able to play from the start of the game.

In the following sequence of images, the AI agent finds itself halfway up the ladder that leads to the key. Once it learns to climb the ladder from there, the model can have it start at the point where it needs to jump over the skull. After it learns to do that, the model can have it start on the rope leading to the floor of the room, etc. Eventually, the agent starts in the original starting state of the game and is able to reach the key completely by itself.

By creating sub-training episodes rolling the state of the demonstration back in time, the model is decomposing a large exploitation problem into small and easy to solve exploration problems. More importantly, this approach moves the complexity of the problem from exponential to linear. If a specific sequence of N actions is required to reach a reward, this sequence may now be learned in a time that is linear in N.

The OpenAI one-shot learning approach to solve Montezuma’s Revenge is incredibly clever. The specific reinforcement learning algorithm OpenAI used to learn Montezuma’s Revenge is a variant of Proximal Policy Optimization which is the same method that was used in their famous OpenAI Five system. Initial tests have a model scoring 74,500 points using a single demonstration. The code and demos are available on Github. In just a few months after its initial release, the learning methods pioneered by the Montezuma’s Revenge agents have served as inspiration for other learning methods that resemble the Memento model.

Mallik Riaz

SEO blogger

5 年

Mallik Riaz

SEO blogger

5 年

https://www.beingglad.com

Yulia Shmitko

Full-time mom | Account Manager | Сreate innovative custom software solutions, reduce the costs and eliminate the barriers to reach your business goals | FinTech | Blockchain solutions

5 年

Thanks for the article - it's interesting. And I'm gonna watch the movie this weekend for sure :)

查看更多评论

要查看或添加评论，请登录

Jesus Rodriguez的更多文章

Robust Agents Are All We Need: Faktory Emerges from Stealth Mode with a Private?Alpha

2024年2月28日

Robust Agents Are All We Need: Faktory Emerges from Stealth Mode with a Private?Alpha

Last year, I had the unique opportunity to incubate a new project in the autonomous agents space, alongside a…

1 条评论
Google’s BLEURT is BERT for Evaluating Natural Language Generation Models

2020年5月27日

Google’s BLEURT is BERT for Evaluating Natural Language Generation Models

Natural language generation(NLG) is one of the fastest growing areas of research in deep learning. NLG applications are…
Two Deep Learning Frameworks and an AI Super-Computer: Microsoft Launches New Efforts to Achieve Large-Scale AI

2020年5月25日

Two Deep Learning Frameworks and an AI Super-Computer: Microsoft Launches New Efforts to Achieve Large-Scale AI

Training models with massive datasets is becoming the norm in modern deep learning applications. Some of the latest…
Uber Open Sources a New Framework for Designing Optimal Statistical Experiments

2020年5月18日

Uber Open Sources a New Framework for Designing Optimal Statistical Experiments

Rapid experimentation is a key element of modern software development. The raise in popularity of machine learning, has…
Uber Unveils Its New Data Quality Management Solution

2020年5月13日

Uber Unveils Its New Data Quality Management Solution

Data quality management is one of those often forgotten aspects of machine learning workflows. Small inconsistencies or…
LinkedIn Open Sources a Small Component to Simplify the TensorFlow-Spark Interoperability

2020年5月7日

LinkedIn Open Sources a Small Component to Simplify the TensorFlow-Spark Interoperability

Interoperating TensorFlow and Apache Spark is a common challenge in real world machine learning scenarios. TensorFlow…
Google Unveils TAPAS, a BERT-Based Neural Network for Querying Tables Using Natural Language

2020年5月6日

Google Unveils TAPAS, a BERT-Based Neural Network for Querying Tables Using Natural Language

Querying relational data structures using natural languages has long been a dream of technologists in the space. With…
Facebook Open Sources Blender, the Largest-Ever Open Domain Chatbot

2020年5月4日

Facebook Open Sources Blender, the Largest-Ever Open Domain Chatbot

Natural language understanding(NLU) has been one of the most active areas adopting state-pf-the-art deep learning…

2 条评论
Microsoft Research Unveils Three Efforts to Advance Deep Generative Models

2020年4月27日

Microsoft Research Unveils Three Efforts to Advance Deep Generative Models

Generative models have been an important component of machine learning for the last few decades. With the emergence of…
Facebook and Amazon Bring Two Projects to PyTorch 1.5 that Streamline the Lifecycle of Production-Ready Deep Learning Models

2020年4月22日

Facebook and Amazon Bring Two Projects to PyTorch 1.5 that Streamline the Lifecycle of Production-Ready Deep Learning Models

PyTorch is one of the fastest growing open source projects in the deep learning space. Initially incubated by Facebook,…

See all articles

Memento Learning: How OpenAI Created AI Agents that can Learn by Going Backwards

Jesus Rodriguez

CEO of IntoTheBlock, Co-Founder, Co-Founder of LayerLens, Faktory,and NeuralFabric, Founder of The Sequence AI Newsletter, Guest Lecturer at Columbia, Guest Lecturer at Wharton Business School, Investor, Author.

In this Thanksgiving Holiday, I wanted to look back at one of the most groundbreaking AI developments of last year.

One-Shot Reinforcement Learning

Creating a Memento Effect with Backward Demonstrations

Jesus Rodriguez的更多文章

社区洞察

其他会员也浏览了

??WWWH: Transfer learning, fine-tuning, Multi-tasking learning, Federated learning, and Meta learning??

Supervised vs. Reinforcement Learning: Real-World Applications & Key Insights

A Brief And Absolutely Incomplete Guide To Reinforcement Learning

Decoding Machine Learning: Supervised, Unsupervised, and Reinforcement Learning, Plus LLM Caveats

Beyond Reinforcement Learning: Teaching AI to Solve Super-Human Tasks by Amplifying Weak Teachers

The Memento Effect: How AI Agents Learned to Play Montezuma’s Revenge by Going Backwards?

Reset-Free Reinforcement Learning

Deep Learning: Play Games and Survive!

Momentum Contrastive Learning

Self-Supervised Learning in Artificial Intelligence

In this Thanksgiving Holiday, I wanted to look back at one of the most groundbreaking AI developments of last year.

One-Shot Reinforcement Learning

Creating a Memento Effect with Backward Demonstrations

Jesus Rodriguez的更多文章

Robust Agents Are All We Need: Faktory Emerges from Stealth Mode with a Private?Alpha

Google’s BLEURT is BERT for Evaluating Natural Language Generation Models

Two Deep Learning Frameworks and an AI Super-Computer: Microsoft Launches New Efforts to Achieve Large-Scale AI

Uber Open Sources a New Framework for Designing Optimal Statistical Experiments

Uber Unveils Its New Data Quality Management Solution

LinkedIn Open Sources a Small Component to Simplify the TensorFlow-Spark Interoperability

Google Unveils TAPAS, a BERT-Based Neural Network for Querying Tables Using Natural Language

Facebook Open Sources Blender, the Largest-Ever Open Domain Chatbot

Microsoft Research Unveils Three Efforts to Advance Deep Generative Models

Facebook and Amazon Bring Two Projects to PyTorch 1.5 that Streamline the Lifecycle of Production-Ready Deep Learning Models

社区洞察

其他会员也浏览了

??WWWH: Transfer learning, fine-tuning, Multi-tasking learning, Federated learning, and Meta learning??

Supervised vs. Reinforcement Learning: Real-World Applications & Key Insights

A Brief And Absolutely Incomplete Guide To Reinforcement Learning

Decoding Machine Learning: Supervised, Unsupervised, and Reinforcement Learning, Plus LLM Caveats

Beyond Reinforcement Learning: Teaching AI to Solve Super-Human Tasks by Amplifying Weak Teachers

The Memento Effect: How AI Agents Learned to Play Montezuma’s Revenge by Going Backwards?

Reset-Free Reinforcement Learning

Deep Learning: Play Games and Survive!

Momentum Contrastive Learning

Self-Supervised Learning in Artificial Intelligence