Dancing With the Skulls
Working in the field of Artificial Intelligence (AI) is full of excitements and one of the most excitement moments for me was watching an AI agent dancing with skulls in the challenging game of Montezuma's Revenge . This was part of OpenAI's breakthrough with Random Network Distillation (RND) and Reinforcement Learning (RL) in a paper published back in 2018 called “Exploration by Random Network Distillation”.
Let me first explore RL and the challenges that made OpenAI come up with RND to conquer. In simple words, RL is a computational approach to goal-directed learning from interaction, where an agent learns to make decisions by taking actions in an environment to maximize cumulative rewards over time. The process involves an agent observing the state of the environment, taking an action that causes a change in the state, and receiving a reward that guides future actions to achieve long-term goals.
RL can solve a variety of complex, goal-directed problems where an agent must learn to make a sequence of decisions through interaction with its environment. Key examples include DeepMind's AlphaGo Zero, which learned to play Go and outperform the human world champion (I really recommend watching the documentary movie AlphaGo ), and TD-Gammon, which achieved superhuman performance in backgammon. RL has also been used to play Atari arcade games using pixel inputs and training robotic agents for competitions like RoboCup. RL methods are useful for any problem that requires sequential decision-making to achieve a goal.
Many algorithms have been used for RL such as Monte Carlo, Dynamic Programming, and Temporal-Differences (Sarsa, Q-Learning, and Dyna-Q) for tabular environments and Deep RL (DQN), REINFORCE, and DDPG for continuous environments. As most of the Atari arcade games are continuous environments, DQN was used widely to train the AI agent by using Convolutional Neural Network (CNN) with the input (pixels). DQN was able to achieve superhuman performance in many games and below human performance in others but scored zero points in Montezuma's Revenge game!
DQN performed very poorly in the game Montezuma's Revenge due to the challenges presented by the game's sparse rewards and the complexity of the required action sequences. In the initial state, the agent must navigate through a series of precise actions to reach the key, the first reward in the game. Since rewards are only given when the key is collected and when a door is unlocked, the agent has to rely purely on random exploration to find the key. The probability of the agent randomly executing the correct sequence of actions to reach the key from the starting state is extremely low. This sparse reward problem makes it difficult for traditional RL methods like DQN, which depend on frequent rewards to learn effectively. Additionally, the credit-assignment problem complicates learning further. The agent struggles to determine which of the many exploratory actions it took contributed to achieving the reward. Montezuma's Revenge represents a hard exploration problem, where finding effective strategies requires overcoming significant obstacles related to sparse rewards and the propagation of reward information back through the sequence of actions. As a result, DQN and similar agents often fail to progress beyond the initial stages of the game.
To address this challenge, OpenAI introduced RND as a form of intrinsic motivation—curiosity. The main idea is to reward the agent not just for achieving explicit goals but also for exploring new, unknown states. RND leverages a fixed random neural network and a predictor network. The fixed network outputs a random projection of the state, while the predictor network attempts to predict these projections. The difference between the predicted and actual outputs (the prediction error) serves as an intrinsic reward. When an agent encounters a novel state, the prediction error is high, encouraging the agent to explore more. One of the significant advantages of RND is its strength against deceptive states, such as the noisy-TV problem. In scenarios where random distractions can mislead the agent, RND maintains focus on genuine exploration by not being easily tricked by random, high-variance states. The application of RND has shown remarkable improvements in exploration-heavy tasks. In Montezuma's Revenge, RND agent outperform the traditional RL agents by a big margin, showing the potential of curiosity-driven learning in tackling complex, sparse-reward scenarios.
Now, picture an AI agent stepping into the chambers of Montezuma's Revenge, only to start an unexpected dance with the skulls scattered around. The agent, driven by its curiosity-based RND algorithm, wasn't just avoiding traps or collecting keys. Because the novelty of these unsettling interactions produced large prediction errors in its neural network, it was moving around skeletons and dancing with joy.
You can watch the dance and the whole AI agent performance in below youtube link:
领英推荐
RND represented an addition to RL, highlighting the importance of intrinsic rewards in enhancing exploration. By encouraging curiosity, RND enabled the AI agent to "dance with the skulls" of challenging environments, unlocking new levels of performance and understanding.
References:
University of Bath
Sutton, R.S. and Barto, A.G., 2018.?Reinforcement learning: An introduction. MIT Press