Reinforcement learning

Reinforcement learning

Reinforcement learning is a machine learning training method based on rewarding desired behaviors and/or punishing undesired ones. In general, a reinforcement learning agent is able to perceive and interpret its environment, take actions and learn through trial and error.

The basic concept is explained by Markov's decision process (MDP).?

There are 5 main components.

Environment: The environment is the surroundings with which the agent interacts. For example, the room where the robot moves. The agent cannot manipulate the environment; it can only control its own actions. In other words, the car can’t control where a lines are on the road, but it can act around it.??

Agent: An agent is the entity which we are training to make correct decisions. For example, a ball that is being trained to move around a maze and finding an exit.

State: The state defines the current situation of the agent This can be the exact position of the ball in the maze. It all depends on how you address the problem.

Action: The choice that the agent makes at the current time step. For example, the ball can move Up, Right, Down, Left. We know defined sets of actions that the agent can perform in advance.

Policy: A policy is the thought process behind picking an action. It’s a probability distribution assigned to the set of actions. Highly rewarding actions will have a high probability and vice versa. If an action has a low probability, it doesn’t mean it won’t be picked at all. It’s just less likely to be picked.

No alt text provided for this image


The agent takes an action which will change its state in the environment. For that action the agent will get a reward. So each action has a reward. The goal for the agent is to maximize the rewards with the policy it choses.

The Agent is in state S(0) and and it can take a action A(0) or A(1) and can land in State S(0) or S(1) depending on the action.

No alt text provided for this image

There are 2 types of decision process.

Finite: Where Actions are predined, and States are limited like a maze.

Infinite: Here we may have infinite number of states like a car on the road, where speed, location can have any value.

Moreover the MDP can be either episodic or continuous

Episodic process will terminate at some point in time (Like a game of chess)

Continuous process will never end, it simply keeps going.


Trajectory: is the trace generated when the agent moves from one state to another.

No alt text provided for this image

Episode: is simply a trajectory that starts in the initial state and ends in the final state.


Reward: the goals of the task are represented by the rewards that the environment gives the agent in response to its actions. So in order to solve the task in the best way possible, we want to maximize the sum of those rewards. Reward is the immediate result that our actions produce.


Return: is the sum of rewards that the agent obtains from a certain point in time (t) the until the task is completed. Since we wanted to maximize the long term sum of rewards, we can also say that we want to maximize the expected return.


No alt text provided for this image

Discount factor: is the reduction in reward for each step, to motivate the agent to select the most optimum action. In the maze problem, every time the agent is not able to find the exit and end the game, the reward will be 0. Only when the agent finds the exit, the award will be 1. But with the discunt factor, the final reward is reduced by the number of steps it takes to reach it, so to maximize the return, the agent will try to find the shortest path.

Policy: The policy is a function that takes as input a state and returns the action to be taken in that state.

No alt text provided for this image

To the code


import numpy as np
import matplotlib.pyplot as plt


from envs import Maze # internal code (envs)
from utils import plot_policy, plot_values, test_agent # internal code utils

env = Maze()
        

Display the available states and actions

print(f"Observation space shape: {env.observation_space.nvec}")
print(f"Number of actions: {env.action_space.n}")

>>>Observation space shape: [5 5]
>>>>Number of actions: 4        

Now create a array which will be populated later with correct probability of taking an action, right now for each state (5x5) we have 4 actions each with probability of 0.25

policy_probs = np.full((5, 5, 4), 0.25)
plot_policy(policy_probs, frame)        
No alt text provided for this image


Lets define a function "policy" which will take input as state and return the probability of taking actions.

def policy(state)
??return policy_probs[state]:        

Lets see how the maze is solved by random actions where probability of each action is 0.25

No alt text provided for this image


Now we will create a table where we will keep the return values for each cell. Initially it is all zeros.

state_values = np.zeros(shape=(5,5))

plot_values(state_values, frame)        
No alt text provided for this image

As we see all values (returns) are 0 to start with

In this function, we will iterate in the maze till we see our value changes are less than a number (theta). We will reduce the reward with each step by a amount Gamma, and for each step the reward we are giving our agent is (-1) if the goal is not reached, and (0) if the goal is reached. More like a punishment than a reward.


def value_iteration(policy_probs, state_values, theta=1e-6, gamma=0.99):
? ? # a large value to enter the loop.
    delta = float('inf') 

? ? while delta > theta:
? ? ? ? delta = 0
? ? ? ? plot_values(state_values, frame)
        # all the matrix
? ? ? ? for row in range(5): 
? ? ? ? ? ? for col in range(5):
? ? ? ? ? ? ? ? old_value = state_values[(row, col)]
                # this is what we are finding
? ? ? ? ? ? ? ? action_probs = None 
? ? ? ? ? ? ? ? max_qsa = float('-inf')

                # we have 4 possible actions
? ? ? ? ? ? ? ? for action in range(4): 
                    # assume if we take the action, and move to next stage, what will be our reward 
? ? ? ? ? ? ? ? ? ? next_state, reward, _, _ = env.simulate_step((row, col), action) 
                    # this is our return in taking the simulated action. Gamma here is the discount factor.
? ? ? ? ? ? ? ? ? ? qsa = reward + gamma * state_values[next_state] 
                    # find max return for the 4 actions
? ? ? ? ? ? ? ? ? ? if qsa > max_qsa: 
? ? ? ? ? ? ? ? ? ? ? ? max_qsa = qsa
? ? ? ? ? ? ? ? ? ? ? ? action_probs = np.zeros(4)
                        # assign 100% probablity to the action with max return.
? ? ? ? ? ? ? ? ? ? ? ? action_probs[action] = 1. 

                # update the returns
? ? ? ? ? ? ? ? state_values[(row, col)] = max_qsa
                # update the probablilites
? ? ? ? ? ? ? ? policy_probs[(row, col)] = action_probs

                # see if we can exit the loop
? ? ? ? ? ? ? ? delta = max(delta, abs(max_qsa - old_value))        

Lets execute the function to find return values for each cell in maze

value_iteration(policy_probs, state_values)        

Initial values of the returns

No alt text provided for this image

After the first loop

No alt text provided for this image

Second iteration

No alt text provided for this image

Third iteration

No alt text provided for this image

Final after a couple of iterations

No alt text provided for this image

Remember doing this we also populated our "policy" which contain probability matrix for each action

plot_policy(policy_probs, frame)        
No alt text provided for this image


Time to check the maze agent in action

test_agent(env, policy)        
No alt text provided for this image


The above example is the most basic approach where we are exploring the whole maze, and identifying retruns for each cell in the maze. There are many extensions and improvements to these algorithms so as to optimize either q values, rewards, and randomizations in policy which helps make the learning faster.


要查看或添加评论,请登录

Amit Tomar的更多文章

  • Quantum computing - Maths

    Quantum computing - Maths

    My next few posts will be about quantum computing, (not the Iran’s version where they portrayed a ZedBoard Zynq-7000 as…

  • Speech to Text

    Speech to Text

    Speech-to-text technology converts spoken language into written text using automated speech recognition (ASR). It is…

  • Model Quantization

    Model Quantization

    Model quantization is a technique used to optimize machine learning models by reducing their size and computational…

  • Cryptocurrency: How it works

    Cryptocurrency: How it works

    A cryptocurrency is a digital currency designed to work as a medium of exchange through a computer network that is not…

  • GAN: Diffusion Model

    GAN: Diffusion Model

    A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for…

  • Transformer model for Computer vision

    Transformer model for Computer vision

    When the transformer model came in the “Attention is all you need” paper, it changed the way NLP tasks were handled. I…

    1 条评论
  • Custom back-propagation

    Custom back-propagation

    A neural network is a network of neurons. A neuron is a mathematical function which transforms the input data elements…

    1 条评论
  • Asymmetric Encryption

    Asymmetric Encryption

    Public-key cryptography, or asymmetric cryptography, is the field of cryptographic systems that use pairs of related…

    1 条评论
  • SYMMETRIC ENCRYPTION

    SYMMETRIC ENCRYPTION

    A Symmetric cipher is an algorithm which uses only 1 key for both encryption and decryption. So the same key is used to…

  • WebRTC flow

    WebRTC flow

    Let's simplify how webRTC works. There are 3 parts to webRTC.

社区洞察

其他会员也浏览了