登录查看更多内容

Project in Reinforcement-Learning

Charlotte Vaessen

App Innovation @ Microsoft

发布日期: 2021年7月27日

Introduction

In this project about autonomous systems, it was intended to learn and get familiar with the concepts of Reinforcement Learning (RL). The open-source project "The Unity Machine Learning Agents Toolkit" provided the necessary environment, enabling simulations to train intelligent agents with own implemented algorithms. The Worm domain has been chosen - in which the worm (agent) has to obtain the green object in the environment.

Initially, the worm has no prior knowledge about the environment. In the first few simulations it does not know how to behave properly in order to achieve its goal.

By taking actions and observing the environment, internal rewards are received when crawling towards or reaching the green goal. Consequently after numerous episodes of simulations, it steadily learns which actions to take, leading eventually into obtaining the object.

The classic algorithms Proximal Policy Optimization (PPO) and Actor Critic (A2C) have been implemented. The primary reason for these implementations was to discover methods and thus gain experience with basic RL-concepts.

Proximal Policy Optimization (PPO)

A policy by definition is the agent's way of behaving at a given time.?So being in a particular state, the policy describes what specific action to take. After executing an action, the agent collects this experience and updates its policy.

The key contribution of PPO is ensuring that a new update of the policy does not change it too much from the previous policy. This leads to less variance in training at the cost of some bias, but ensures smoother training and also makes sure the agent does not go down an unrecoverable path of taking senseless actions.

The training with many simulations (episodes) has been plotted. An increasing reward after batches of episodes is seen meaning that after time, the worm agent learns how to behave "better" to reach its goal.

Es wurde kein Alt-Text für dieses Bild angegeben.

Actor Critic (A2C) - A two model algorithm

This idea of having two models interact (or compete) with each other is getting more and more popular in the field of machine learning in the last years.?

The Actor takes as input the state and outputs the best action. It essentially controls how the agent behaves described in?its policy.

领英推荐

Manufacturing Defect Detection Using Unsupervised…

AI Partnerships Corp. 2 年前

Reinforcement Learning: AI’s Autonomous Evolution

Neil Sahota 1 年前

Zero-Shot and Few-Shot Learning training

Bluechip Technologies Asia 1 年前

The Critic, on the other hand,?evaluates the action of the Actor. This leads to an update of the Actor's policy.

After numerous batches of episodes, the reward increases as well - meaning the agent with the implemented A2C algorithm has its steady learning process by updating its policy.

Interesting Discoveries

Hyper-parameters such as Learning-rate, batch-sizes, topology and size of neural network are part of the implementations and values, which are used to control the learning process and can not be learned by the agent itself. Subsequently, the optimal hyper-parameter had to be found in order to achieve the highest rewards.

As you see in the following graph, different learning rates resulted in significantly different outcomes. Having higher learning-rates forces the agent to learn faster which leads to higher variance in the learning process. Small learning-rates are smoother and need more simulations and time for the agent to learn.

Further Notes

We would like to mention the organizers Thomy Pham and Fabian Ritz. Thank you for enabling us to deep dive into such a fascinating topic!

Visit our github repository: https://github.com/charlola/autonomous-systems

Team:

Dominik Fuchs, Oliver Palotás, Patrick Suchostawski, Georg Staber, Alexander Welling & Charlotte Vaessen

Junyu Qu

Data Scientist | Google Certified TensorFlow Developer | Azure Certified AI Engineer

3 年

Daumen hoch für dich!!!

查看更多评论

要查看或添加评论，请登录

Charlotte Vaessen的更多文章

City-Brain-Challenge

2021年8月11日

City-Brain-Challenge

Traffic congestion is one of the most significant problems in large cities, especially during peak hours. According to…

Project in Reinforcement-Learning

Charlotte Vaessen

App Innovation @ Microsoft

Introduction

Proximal Policy Optimization (PPO)

Actor Critic (A2C) - A two model algorithm

领英推荐

Interesting Discoveries

Further Notes

Charlotte Vaessen的更多文章

社区洞察

其他会员也浏览了

Paper Review: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Quality 4.0: learning quality control the evolution of statistical quality control

The Philosophy of Reinforcement Learning: How Algorithms Mirror Human Choices, Beliefs, and Discipline

AI trends 2022?-?IV?-?Reinforced Learning

Visualizing the Future with Q-Learning

Learning to Decide: How Machine Learning Drives Autonomy with Reinforcement Learning

How Machines Learn Like Humans: The L.I.F.E. Framework

AI Atlas #21: Zero-Shot Learning

Human-in-the-Loop Machine Learning: A State-of-the-Art Overview

Continuous Learning Models: The Key to Staying Relevant in Dynamic Environments.

Introduction

Proximal Policy Optimization (PPO)

Actor Critic (A2C) - A two model algorithm

领英推荐

Interesting Discoveries

Further Notes

Charlotte Vaessen的更多文章

City-Brain-Challenge

社区洞察

其他会员也浏览了

Paper Review: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Quality 4.0: learning quality control the evolution of statistical quality control

The Philosophy of Reinforcement Learning: How Algorithms Mirror Human Choices, Beliefs, and Discipline

AI trends 2022?-?IV?-?Reinforced Learning

Visualizing the Future with Q-Learning

Learning to Decide: How Machine Learning Drives Autonomy with Reinforcement Learning

How Machines Learn Like Humans: The L.I.F.E. Framework

AI Atlas #21: Zero-Shot Learning

Human-in-the-Loop Machine Learning: A State-of-the-Art Overview

Continuous Learning Models: The Key to Staying Relevant in Dynamic Environments.