Reinforcement learning with  Neural Network use case

Reinforcement learning with Neural Network use case

Introduction :

Reinforcement learning is learning from interaction with the environment. Here the learneris called the Agent. Everything outside the Agent is called the Environment. The Agentperforms actions continuously and the Environment responds to all those actions and presents new situations to the Agent. Furthermore the environment gives feedback for all the actions, called a reward; it is a numeric value. The Agent goal is to maximize this reward. A complete specification of an environment defines

Moreover, the Agent and Environment interact at discrete time steps t=0,1,2,3,... At each time step t, the Agent receives some representation of the Environment state, St E S, where S is the set of possible states. On that basis, it selects an action, At E A(St), where A(St) is the set of actions available in state St. One time step later, in part as a consequence of its action,the Agent receives a numerical reward, Rt+1 E R, and finds itself in a new state, St+1:

Let's take an example of GridWorld; people who are into reinforcement learning love to think about it. This GridWorld shown in Figure 1.9 is a 3 X 4 grid. For the purpose of this discussion, think that the world is a kind of game; you start from a state that is called start state and you are able to execute actions, in this case, up, down, left, and right. Here, the green square represents your goal, the red square represents failure, and the black square is where you cannot enter. It actually acts as a wall. If you reach the green square (goal), the world is over and you begin from the start state again. The same holds for the red square. If you reach the red square (failure), the world is over and you have to start over again. This means you cannot go through the red square to get to the green square.The purpose here is to roam around this world in such a way that eventually you reach the goal state and under all circumstances you avoid the red spot. Here you can go up, down, left, and right but if you are on the boundary state such as (1,3) and you try to go up or left, you just stay where you are. If you try to go right, you actually end up in the next square

What is the shortest sequence of actions that gets us from start state to goal state? There are two options: 1.Up, up, right, right, right 2.Right, right, up, up, right Both the answers are correct, taking five steps to reach the goal state.

The previous question was very easy because each time you take an action, it does exactly what you expected it to do. Now introduce a little bit of uncertainty into this GridWorld problem. When you execute an action, it executes correctly with a probability of 0.8. This means 80 percent of the time when you take an action, it works as expected and goes up, down, right, or left. But 20 percent of the time, it actually (incorrectly) causes you to move by a right angle. If you move up, there is a probability of 0.1 (10 percent) to go left and 0.1 (10 percent) to go right. Now what is the reliability of your sequence of up, up, right, right, right in getting you to the goal state given?

To calculate it, we need to do some math. The correct answer is 0.32776. Let me explain how this value is computed. From the Start state we need to go up, up, right, right, right. Each of those actions works as it is supposed to do with a probability of 0.8. So: (0.8)5 = 0.32768.Now we have the probability that the entire sequence will work as intended. As you noticed, 0.32768 is not equal to the correct answer 0.32776. It's actually a very small difference of 0.00008. Now we need to calculate the probability of uncertainties, it means we need to calculate the probability that the intended sequence of events will not work as it is supposed to.

Let's go through this again. Is there any way you could have ended up falling into the goal from that sequence of commands by not following the intended path? Actions can have unintended consequences, and they often do. Suppose you are on the start state and you go up in the first step; there is a probability of 0.1 that you will actually go to the right. From there, if you go up, there is again a probability of 0.1 that you will actually go to the right.

From there, the next thing we do is take the right action as per our intended sequence, but that can actually go up with a probability of 0.1. Then another 0.1 to get to the next right action can actually cause an up to happen. And finally, that last right might actually execute correctly with a probability of 0.8 to bring us to the goal state: 0.1X0.1X0.1X0.1X0.8 = 0.00008 ; Now add both of them and you get the correct answer: 0.32668 + 0.00008 = 0.32776 What we did in the first case is to come up with a sequence of up, up, right, right, right where it is sort of planned out what we do in a world where nothing could go wrong; it's actually like an ideal world. But once we introduce this notion of uncertainty or the randomness, we have to do something other than work out in advance what the right answer is, and then just go. Either we have to execute the sequence and once in a while we have to drift away and re-plan to come up with a new sequence wherever it happened to end up or we come up with some way to incorporate these uncertainties or probabilities that we never really have to rethink of in case something goes wrong.

There is a framework that is very common for capturing these uncertainties directly. It is called Markov Decision Process (MDP);

Exploration versus exploitation :

Exploration implies firm behaviors characterized by finding, risk taking, research, search,and improvement; while exploitation implies firm behaviors characterized by refinement,implementation, efficiency, production, and selection.

Exploration and exploitation are major problems when you learn about the environment while performing several different actions (possibilities). The dilemma is how much more exploration is required, because when you try to explore the environment, you are most likely to keep hitting it negative rewards. Ideal learning requires that you sometimes make bad choices. It means that sometimes the agent has to perform random actions to explore the environment. Sometimes, it gets a positive, or sometimes it gets a reward that is less rewarding. The exploration—exploitation dilemma is really a trade-off.

The following are some examples in real life for exploration versus exploitation:

  1. Restaurant selection:
  • Exploitation: Go to your favorite restaurant
  • Exploration: Try a new restaurant
  1. Online banner advertisements:
  • Exploitation: Show the most successful advert
  • Exploration: Show a different advert
  1. Oil drilling:
  • Exploitation: Drill at the best-known location
  • Exploration: Drill at a new location
  1. Game playing:
  • Exploitation: Play the move you believe is best
  • Exploration: Play an experimental move
  1. Clinical trial:
  • Exploitation: Choose the best treatment so far
  • Exploration: Try a new treatment

Neural network and reinforcement learning

How do neural networks and reinforcement learning fit together? What is the relationship of both these topics? Let me explain it, the structure of a neural network is like any other kind of network. There are interconnected nodes, which are called neurons and the edges that join them together. A neural network comes in layers. The layers are called input layer, the hidden layer and the output layer.

In reinforcement learning, convolutional networks are used to recognize an agent's state.Let's take an example: the screen that Mario is on. That is, it is performing the classical task of image recognition.Don't be confused by a convolutional network with unsupervised learning. It is using different classifications from images in reinforcement learning. On the other hand, in supervised learning, the network is trying to match to an output variable or category to get a label of the image. It is actually getting the label of the image to the pixel:

In supervised learning, it will give the probability of the image with respect to labels. You give it any picture and it will predict in percentages the likelihood of it being a cat or a dog. Shown an image of a dog, it might decide that the picture is 75 percent likely to be a dog and 25 percent likely to be a cat.


Sample Code :

package project;

import burlap.shell.visual.VisualExplorer;

import burlap.mdp.singleagent.SADomain;

import burlap.domain.singleagent.gridworld.GridWorldDomain;

import burlap.domain.singleagent.gridworld.GridWorldVisualizer;

import burlap.domain.singleagent.gridworld.state.GridWorldState;

import burlap.domain.singleagent.gridworld.state.GridLocation;

import burlap.domain.singleagent.gridworld.state.GridAgent;

import burlap.mdp.core.state.State;

import burlap.visualizer.Visualiz

public class HelloWorld

{

public static void main(String[] args)

{

//11x11 grid world

GridWorldDomain gridworld = new GridWorldDomain(11,11);

//layout four rooms

gridworld.setMapToFourRooms();

//transitions with 0.9 success rate

gridworld.setProbSucceedTransitionDynamics(0.9);

//now we will create the grid world domain

SADomain sad= gridworld.generateDomain();

//initial state setup

State st = new GridWorldState(new GridAgent(0, 0), new GridLocation(10,

10, "loc0"));

//now we will setup visualizer and visual explorer

Visualizer vis = GridWorldVisualizer.getVisualizer(gridworld.getMap());

VisualExplorer ve= new VisualExplorer(sad, vis, st);

//now setup the control keys move the agent to "a w d s"

ve.addKeyAction("a", GridWorldDomain.ACTION_WEST, "");

ve.addKeyAction("w", GridWorldDomain.ACTION_NORTH, "");

ve.addKeyAction("d", GridWorldDomain.ACTION_EAST, "");

ve.addKeyAction("s", GridWorldDomain.ACTION_SOUTH, "");

ve.initGUI(); } }

要查看或添加评论,请登录

Steven Murhula的更多文章

社区洞察

其他会员也浏览了