Project in Reinforcement-Learning
Introduction
In this project about autonomous systems, it was intended to learn and get familiar with the concepts of Reinforcement Learning (RL). The open-source project "The Unity Machine Learning Agents Toolkit" provided the necessary environment, enabling simulations to train intelligent agents with own implemented algorithms. The Worm domain has been chosen - in which the worm (agent) has to obtain the green object in the environment.
Initially, the worm has no prior knowledge about the environment. In the first few simulations it does not know how to behave properly in order to achieve its goal.
By taking actions and observing the environment, internal rewards are received when crawling towards or reaching the green goal. Consequently after numerous episodes of simulations, it steadily learns which actions to take, leading eventually into obtaining the object.
The classic algorithms Proximal Policy Optimization (PPO) and Actor Critic (A2C) have been implemented. The primary reason for these implementations was to discover methods and thus gain experience with basic RL-concepts.
Proximal Policy Optimization (PPO)
A policy by definition is the agent's way of behaving at a given time.?So being in a particular state, the policy describes what specific action to take. After executing an action, the agent collects this experience and updates its policy.
The key contribution of PPO is ensuring that a new update of the policy does not change it too much from the previous policy. This leads to less variance in training at the cost of some bias, but ensures smoother training and also makes sure the agent does not go down an unrecoverable path of taking senseless actions.
The training with many simulations (episodes) has been plotted. An increasing reward after batches of episodes is seen meaning that after time, the worm agent learns how to behave "better" to reach its goal.
Actor Critic (A2C) - A two model algorithm
This idea of having two models interact (or compete) with each other is getting more and more popular in the field of machine learning in the last years.?
The Actor takes as input the state and outputs the best action. It essentially controls how the agent behaves described in?its policy.
领英推荐
The Critic, on the other hand,?evaluates the action of the Actor. This leads to an update of the Actor's policy.
After numerous batches of episodes, the reward increases as well - meaning the agent with the implemented A2C algorithm has its steady learning process by updating its policy.
Interesting Discoveries
Hyper-parameters such as Learning-rate, batch-sizes, topology and size of neural network are part of the implementations and values, which are used to control the learning process and can not be learned by the agent itself. Subsequently, the optimal hyper-parameter had to be found in order to achieve the highest rewards.
As you see in the following graph, different learning rates resulted in significantly different outcomes. Having higher learning-rates forces the agent to learn faster which leads to higher variance in the learning process. Small learning-rates are smoother and need more simulations and time for the agent to learn.
Further Notes
We would like to mention the organizers Thomy Pham and Fabian Ritz. Thank you for enabling us to deep dive into such a fascinating topic!
Visit our github repository: https://github.com/charlola/autonomous-systems
Team:
Dominik Fuchs, Oliver Palotás, Patrick Suchostawski, Georg Staber, Alexander Welling & Charlotte Vaessen
Data Scientist | Google Certified TensorFlow Developer | Azure Certified AI Engineer
3 年Daumen hoch für dich!!!