登录查看更多内容

Reinforcement learning

Amit Tomar

Director of Engineering at Arlo | Blog: TwoPagers.dev

发布日期: 2023年8月13日

Reinforcement learning is a machine learning training method based on rewarding desired behaviors and/or punishing undesired ones. In general, a reinforcement learning agent is able to perceive and interpret its environment, take actions and learn through trial and error.

The basic concept is explained by Markov's decision process (MDP).?

There are 5 main components.

Environment: The environment is the surroundings with which the agent interacts. For example, the room where the robot moves. The agent cannot manipulate the environment; it can only control its own actions. In other words, the car can’t control where a lines are on the road, but it can act around it.??

Agent: An agent is the entity which we are training to make correct decisions. For example, a ball that is being trained to move around a maze and finding an exit.

State: The state defines the current situation of the agent This can be the exact position of the ball in the maze. It all depends on how you address the problem.

Action: The choice that the agent makes at the current time step. For example, the ball can move Up, Right, Down, Left. We know defined sets of actions that the agent can perform in advance.

Policy: A policy is the thought process behind picking an action. It’s a probability distribution assigned to the set of actions. Highly rewarding actions will have a high probability and vice versa. If an action has a low probability, it doesn’t mean it won’t be picked at all. It’s just less likely to be picked.

The agent takes an action which will change its state in the environment. For that action the agent will get a reward. So each action has a reward. The goal for the agent is to maximize the rewards with the policy it choses.

The Agent is in state S(0) and and it can take a action A(0) or A(1) and can land in State S(0) or S(1) depending on the action.

There are 2 types of decision process.

Finite: Where Actions are predined, and States are limited like a maze.

Infinite: Here we may have infinite number of states like a car on the road, where speed, location can have any value.

Moreover the MDP can be either episodic or continuous

Episodic process will terminate at some point in time (Like a game of chess)

Continuous process will never end, it simply keeps going.

Trajectory: is the trace generated when the agent moves from one state to another.

Episode: is simply a trajectory that starts in the initial state and ends in the final state.

Reward: the goals of the task are represented by the rewards that the environment gives the agent in response to its actions. So in order to solve the task in the best way possible, we want to maximize the sum of those rewards. Reward is the immediate result that our actions produce.

Return: is the sum of rewards that the agent obtains from a certain point in time (t) the until the task is completed. Since we wanted to maximize the long term sum of rewards, we can also say that we want to maximize the expected return.

Discount factor: is the reduction in reward for each step, to motivate the agent to select the most optimum action. In the maze problem, every time the agent is not able to find the exit and end the game, the reward will be 0. Only when the agent finds the exit, the award will be 1. But with the discunt factor, the final reward is reduced by the number of steps it takes to reach it, so to maximize the return, the agent will try to find the shortest path.

Policy: The policy is a function that takes as input a state and returns the action to be taken in that state.

To the code

import numpy as np
import matplotlib.pyplot as plt


from envs import Maze # internal code (envs)
from utils import plot_policy, plot_values, test_agent # internal code utils

env = Maze()

Display the available states and actions

print(f"Observation space shape: {env.observation_space.nvec}")
print(f"Number of actions: {env.action_space.n}")

>>>Observation space shape: [5 5]
>>>>Number of actions: 4

Now create a array which will be populated later with correct probability of taking an action, right now for each state (5x5) we have 4 actions each with probability of 0.25

领英推荐

Q*? Q-learning? Unlocking Reinforcement Learning (Part…

Alex Wang 1 年前

Reinforcement Learning in?Practice

Luis Soares 1 年前

Safe Reinforcement Learning - Part II

Haitham Bou-Ammar 2 年前

policy_probs = np.full((5, 5, 4), 0.25)
plot_policy(policy_probs, frame)

Lets define a function "policy" which will take input as state and return the probability of taking actions.

def policy(state)
??return policy_probs[state]:

Lets see how the maze is solved by random actions where probability of each action is 0.25

Now we will create a table where we will keep the return values for each cell. Initially it is all zeros.

state_values = np.zeros(shape=(5,5))

plot_values(state_values, frame)

As we see all values (returns) are 0 to start with

In this function, we will iterate in the maze till we see our value changes are less than a number (theta). We will reduce the reward with each step by a amount Gamma, and for each step the reward we are giving our agent is (-1) if the goal is not reached, and (0) if the goal is reached. More like a punishment than a reward.

def value_iteration(policy_probs, state_values, theta=1e-6, gamma=0.99):
? ? # a large value to enter the loop.
    delta = float('inf') 

? ? while delta > theta:
? ? ? ? delta = 0
? ? ? ? plot_values(state_values, frame)
        # all the matrix
? ? ? ? for row in range(5): 
? ? ? ? ? ? for col in range(5):
? ? ? ? ? ? ? ? old_value = state_values[(row, col)]
                # this is what we are finding
? ? ? ? ? ? ? ? action_probs = None 
? ? ? ? ? ? ? ? max_qsa = float('-inf')

                # we have 4 possible actions
? ? ? ? ? ? ? ? for action in range(4): 
                    # assume if we take the action, and move to next stage, what will be our reward 
? ? ? ? ? ? ? ? ? ? next_state, reward, _, _ = env.simulate_step((row, col), action) 
                    # this is our return in taking the simulated action. Gamma here is the discount factor.
? ? ? ? ? ? ? ? ? ? qsa = reward + gamma * state_values[next_state] 
                    # find max return for the 4 actions
? ? ? ? ? ? ? ? ? ? if qsa > max_qsa: 
? ? ? ? ? ? ? ? ? ? ? ? max_qsa = qsa
? ? ? ? ? ? ? ? ? ? ? ? action_probs = np.zeros(4)
                        # assign 100% probablity to the action with max return.
? ? ? ? ? ? ? ? ? ? ? ? action_probs[action] = 1. 

                # update the returns
? ? ? ? ? ? ? ? state_values[(row, col)] = max_qsa
                # update the probablilites
? ? ? ? ? ? ? ? policy_probs[(row, col)] = action_probs

                # see if we can exit the loop
? ? ? ? ? ? ? ? delta = max(delta, abs(max_qsa - old_value))

Lets execute the function to find return values for each cell in maze

value_iteration(policy_probs, state_values)

Initial values of the returns

After the first loop

Second iteration

Third iteration

Final after a couple of iterations

Remember doing this we also populated our "policy" which contain probability matrix for each action

plot_policy(policy_probs, frame)

Time to check the maze agent in action

test_agent(env, policy)

The above example is the most basic approach where we are exploring the whole maze, and identifying retruns for each cell in the maze. There are many extensions and improvements to these algorithms so as to optimize either q values, rewards, and randomizations in policy which helps make the learning faster.

要查看或添加评论，请登录

Amit Tomar的更多文章

Quantum computing - Maths

2025年3月12日

Quantum computing - Maths

My next few posts will be about quantum computing, (not the Iran’s version where they portrayed a ZedBoard Zynq-7000 as…
Speech to Text

2025年2月1日

Speech to Text

Speech-to-text technology converts spoken language into written text using automated speech recognition (ASR). It is…
Model Quantization

2024年12月29日

Model Quantization

Model quantization is a technique used to optimize machine learning models by reducing their size and computational…
Cryptocurrency: How it works

2024年8月9日

Cryptocurrency: How it works

A cryptocurrency is a digital currency designed to work as a medium of exchange through a computer network that is not…
GAN: Diffusion Model

2024年7月18日

GAN: Diffusion Model

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for…
Transformer model for Computer vision

2024年5月27日

Transformer model for Computer vision

When the transformer model came in the “Attention is all you need” paper, it changed the way NLP tasks were handled. I…

1 条评论
Custom back-propagation

2024年4月22日

Custom back-propagation

A neural network is a network of neurons. A neuron is a mathematical function which transforms the input data elements…

1 条评论
Asymmetric Encryption

2024年3月10日

Asymmetric Encryption

Public-key cryptography, or asymmetric cryptography, is the field of cryptographic systems that use pairs of related…

1 条评论
SYMMETRIC ENCRYPTION

2024年3月2日

SYMMETRIC ENCRYPTION

A Symmetric cipher is an algorithm which uses only 1 key for both encryption and decryption. So the same key is used to…
WebRTC flow

2024年2月4日

WebRTC flow

Let's simplify how webRTC works. There are 3 parts to webRTC.

See all articles

Reinforcement learning

Amit Tomar

Director of Engineering at Arlo | Blog: TwoPagers.dev

领英推荐

Amit Tomar的更多文章

社区洞察

其他会员也浏览了

Safe Reinforcement Learning - Part I

Introduction to Reinforcement Learning

The agi and reinforcement learning with human in the loop debate

ReMA: Learning to Meta-Think for LLMs with Multi-Agent Reinforcement Learning

Inverse Reinforcement Learning: Decoding Hidden Objectives

Explanation of the Elements of the Reinforcement Learning Problem

Curiosity-Driven Reinforcement Learning