Samples of Reward Functions for AWS DeepRacer
Introduction
AWS DeepRacer?is a 1/18th scale self-driving racing car that can be trained with reinforcement learning. It is a great way to get started with machine learning. Please see Part 1 and Part 2 of this series to learn more about AWS DeepRacer.
In this article, we will present four examples of the different reward functions. For each function, we will provide the training configuration so that you can recreate it again on your AWS DeepRacer console.
Example 1?
The main idea behind this model came from the examples provided by AWS for reward functions for DeepRacer. Here, we are adopting two important features from example 1 and example 3 and merging them into a new reward function. By combining and modifying two functions, we tried to explore the performance of the model.
Training configuration
Agent configuration
The agent used for this model has a continuous action space with a steering angle of 30 degrees. The minimum and maximum speeds are set to 0.5 m/s and?4 m/s, respectively.?
Hyperparameter
The reinforcement learning algorithm used in this example model is PPO. We have tuned the hyperparameters in a different way than the default one. It helped us explore the result for changes in each hyperparameter. The batch size was changed to 128 to get a more smooth and stable update to the neural network.
We have kept the default value for the entropy and reduced the discount factor. We have used Huber as a loss-type since it is more subtle. We have kept the other parameters the same as default except for the learning rate. Since we have increased the batch size, the learning time has also increased. So, We increased the learning rate to shorten the learning time. The Training configuration used in the example is shown below.
Reward Function
The reward function checks three points to consider giving a reward to the model. These points are specified as three steps:
Step 1: Distance from the centre
The reward function checks the agent's distance from the centre with five markers. Depending on the position of the agent's distance from the centre, the reward will vary. The closer to the centre, the more the reward. We have used five markers to motivate the agent to collect the reward points as well as more options to finish the track faster.
Step 2: All Wheels on Track
We have used this checkpoint to make sure the agents always stay on track. As shown in the figure below, we combined these checkpoints using “ if-else” statements.
Step 3: Reasonable Speed Threshold?
This condition checks whether the agent is speeding properly. We have used a speed threshold value of 1, which means if the speed is slow, it will get less reward than higher speeds.
After combining these three checkpoints, we came up with the following reward function:
def reward_function(params)
???
??? # Read input parameters
??? distance_from_center = params['distance_from_center']
??? track_width = params['track_width']
??
??? all_wheels_on_track = params['all_wheels_on_track']
??? speed = params['speed']
??? SPEED_THRESHOLD = 1
?
??? # Calculate 5 marks father away from the center line
??? marker_1 = 0.1 * track_width
??? marker_2 = 0.20 * track_width
??? marker_3 = 0.30 * track_width
??? marker_4 = 0.40 * track_width
??? marker_5 = 0.5 * track_width
?
??? # Give higher reward if the car is closer to center line
??? if distance_from_center <= marker_1 and all_wheels_on_track:
??????? reward = 3.0
??? elif distance_from_center <= marker_2 and all_wheels_on_track:
??????? reward = 2.5
??? elif distance_from_center <= marker_3 and all_wheels_on_track:
??????? reward = 1.5
??? elif distance_from_center <= marker_4 and all_wheels_on_track:
??????? reward = 1
??? elif distance_from_center <= marker_5 and all_wheels_on_track:
??????? reward = 0.5
??? else:
??????? reward = 1e-3? # likely crashed/ close to off track
???
??? if speed < SPEED_THRESHOLD:
????????# Penalize if the car goes too slow
??????? reward = reward + 0.5
??? else:
???? # High reward if the car stays on track and goes fast
??????? reward = reward + 1.0
???
??? return float(reward)
Reward graph
For the training of this model, we have used the re:invent 2018 track. The initial training time was one hour with a learning rate of 0.0003. After checking the initial graph, we confirmed that this model could give a 100% track completion rate. We have cloned the model and trained it again for one hour with a learning rate of 0.0008. In the figure below, the training graph is shown. The average completion percentage during the training is around 73 per cent which went up to 100 per cent during the evaluation period.
Evaluation results?
The track completion during the evaluation has received 100% track completion with three laps per evaluation. Still, the completion time was much higher than the other models that we will explore later in this article. The lowest track completion time we have got from this model was 16.666 seconds.
From our understanding, we can say that the reward function has some tough penalties that make the agent focus on completion more than the time. We have used an “All wheels on Track” checkpoint, which is very harsh for the agent. The track has curves where the agent could go faster if we allowed it to stay on track with two wheels. We also used a “Distance from the centre” checkpoint, making the agent stay near the centerline. These two checks were giving the agent more rewards than the “SPEED_THRESHOLD” check. So, the agent decided to stay on track while compromising the “SPEED_THRESHOLD” points.
Example 2
The idea is to motivate the agent to keep all the wheels on the track and go around as efficiently as possible. When all wheels are on track while the agent moves forward, then it gets rewarded and increases its speed, so staying on track becomes more motivating.
Training configuration
Agent configuration
The model's agent implements a continuous action space to invent 2018's track with camera sensors. The left and right steering angles are both set to 30 degrees. The speed of the agent has been set to 1 m/s for minimum and maximum at 2.2 m/s.
Hyperparameters
A set of parameters needs to be set before initializing a training, also known as hyperparameters. The hyperparameters are empirical and need to be adjusted for each training. Here, some hyperparameters have been changed, such as the number of epochs that were "10" by default has changed to 6. The Loss type was mean-square by default, but it was changed to Huber. Furthermore, the learning rate has been changed from 0.0004 to 0.0003. The number of experienced episodes between each policy-updating iteration has been set to 16 from 20.
Reward Function?
This reward function is adapted from?an external source?under the title “SelfMotivator”. The function's primary purpose is to encourage or push the vehicle to travel along the track to arrive at its destination without causing an accident or infringement. This agent is trained for 1 hour. Below is the reward function:
def reward_function(params)
?
??? if params["all_wheels_on_track"] and params["steps"] > 0:
??????? reward = ((params["progress"] / params["steps"]) * 100) + (params["speed"]**2)
??? else:
??????? reward = 0.01
??? return float(reward)
Reward graph
From the reward graph, you can see the progress of a model. In this case, we have used re:invent 2018 track to train the model. As shown in the below graph, the average percentage completion (Evaluating) is 100%, the average reward is nearly 70%, and the Average percentage completion (Training) is 75%.
Evaluation Results?
Once the training gets completed and generates a graph, we can go ahead for evaluation. We have selected three laps for the evaluation, and all three laps are 100% completed, as shown below. The fastest track completion time, in this case is 11.935 seconds.
Example 3
This reward function is adopted from this?link with some modifications. The main concept behind this reward function is to write a custom function for the racing track, which is the re:Invent 2018 track in this example. Every action of the agent will consider its current location on the track. In the following sections, we will dig into the details of this function and elaborate on how the function does work.?
领英推荐
Training configuration?
Agent configuration
We have selected a continuous action space with a speed range between 0.5 to 3.5 m/s and a steering angle between -30 and 30. The following shows the current specifications of the agent.?
Reward function
Step 1: Define the desired path and speed
To get the waypoints of the track, we used this Github repository?provided by the AWS DeepRacer community. The Training_analysis notebook provides the visualizations of the waypoints for any selected track. For the re:Invent track, the waypoints are shown in the below figure. The track starts from waypoint 00 and ends at waypoint 69. Now that we have the waypoints, we draw the desired path to the best results. It means we outline a path with a minimum fluctuation agent heading angle to keep it as steady as possible.?
After getting the waypoints, we want to plan how we want the agent to behave at specific points. The following draw adapted from here to show the suggested path used in this reward function. The track consists of three colours, red for the car to tune left, green to go straight forward, and blue areas to prevent the right turns. In addition to the agent position on the track, we have added one extra feature to control the car's speed. The areas highlighted in red are the places where the agent is supposed to slow down, while the areas highlighted in green expect the agent to speed up.?
Finally, the following video shows the overall speed and path that we are aiming for the agent learns to follow.?
Step 1: Define reward function based on the planning
This function specifies three ways to check whether we should reward the agent or penalize it. First, it will check whether all the wheels are on track. The agent will get more rewards by staying inside the track. Anytime any of the wheels go off the way, the agent will be penalized by 10 points.?
Second, it will check the location of the agent based on the waypoints. Since the desired path is specified, we set the attributes of the reward function. First, we divide the waypoints into three arrays, left, right, and centre, based on the desired path. After putting all the points in these three arrays, we define the reward as follows:
At any moment, get the closest waypoint in front of the agent and see if the agent is on the right side of the track based on the desired path. For instance, the closest waypoint is 29, which is a point in the right_lane array. If the parameter “is_left_of _center” is false, the agent will get rewarded 10 points.?
Finally, it will check the speed of the agent based on the waypoints. If the agent is somewhere in the green area, speeding up will give it a high reward. If the agent keeps an average speed on the yellow areas, the agent will also get rewarded. Else, if the agent keeps a low speed in the red areas that are critical due to the turns, the agent will be rewarded. Other than that, we will penalize the agent. Similar to the previous attribute, here we will divide the waypoints based on the desired path as follows:?
Combining all parts together, the final reward function is as follows:
def reward_function(params)
?
??? center_variance = params["distance_from_center"] / params["track_width"]
??? #racing line
??? left_lane = [23,24,50,51,52,53,61,62,63,64,65,66,67,68]#Fill in the waypoints
???
??? center_lane = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,25,26,27,28,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,54,55,56,57,58,59,60,69,70]#Fill in the waypoints
???
??? right_lane = [29,30,31,32,33,34]#Fill in the waypoints
???
??? #Speed
??? fast = [0,1,2,3,4,5,6,7,8,9,25,26,27,28,29,30,31,32,51,52,53,54,61,62,63,64,65,66,67,68,69,70] #3
??? moderate = [33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,55,56,57,58,59,60] #2
??? slow = [10,11,12,13,14,15,16,17,18,19,20,21,22,23,24] #1
?
??? reward = 30
?
??? if params["all_wheels_on_track"]:
??????? reward += 10
??? else:
??????? reward -= 10
?
??? if params["closest_waypoints"][1] in left_lane and params["is_left_of_center"]:
??????? reward += 10
??? elif params["closest_waypoints"][1] in right_lane and not params["is_left_of_center"]:
??????? reward += 10
??? elif params["closest_waypoints"][1] in center_lane and center_variance < 0.4:
??????? reward += 10
??? else:
??????? reward -= 10
???????
??? if params["closest_waypoints"][1] in fast:
??????? if params["speed"] > 1.5 :
??????????? reward += 10
????? ??else:
??????????? reward -= 10
??? elif params["closest_waypoints"][1] in moderate:
??????? if params["speed"] > 1 and params["speed"] <= 1.5 :
??????????? reward += 10
??????? else:
??????????? reward -= 10
??? elif params["closest_waypoints"][1] in slow:
??????? if params["speed"] <= 1 :
??????????? reward += 10
??????? else:
??????????? reward -= 10
???????
???
??? return float(reward):
Reward graph?
From the reward graph, you can see the progress of a model for 3 hours of training.
Evaluation results
We have selected three laps for the evaluation, and all three laps are 100% completed, as shown below. The fastest track completion time, in this case, is 13.260 seconds.
Example 4
Essentially, the idea behind this reward function is to train the agent to remain in the centre by calculating waypoints and observing the agent's heading. Using a reward and penalty system, the function directs the agent back to the centre if the difference between the track direction and the heading direction of the car increases a threshold.
Training configuration
The model's agent implements a continuous action space on re: invent 2018's track with camera sensors. The left and right steering angles are both set to 30 degrees. The speed of the agent has been set to 1.1. m/s for minimum and 2 m/s for maximum.
There are some set of parameters in place so that the agent can learn and explore on its own as it interacts with a complex environment in a continuous control problem. These parameters, also known as hyperparameters, have default values in AWS. We adjusted a few parameters in this model to train the agent in a certain way.?
The “number of epochs” has been changed from 10 to 4, the discount factor has been changed from 0.99 to 0.88, the loss type has been changed from Mean Square to Huber, and the number of episodes between each policy-updating iteration has been reduced from 20 to 18.?
Reward Function??
import mat
?
def reward_function(params):?
??? # Read input variables
??? waypoints = params['waypoints']
??? closest_waypoints = params['closest_waypoints']
??? heading = params['heading']
?
??? # Initialize the reward with typical value
??? reward = 1.0
??? # Calculate the direction of the center line based
??? next_point = waypoints[closest_waypoints[1]]
??? prev_point = waypoints[closest_waypoints[0]]
??? # Calculate the direction in radius, arctan2(dy, dx),
??? track_direction = math.atan2(next_point[1] - prev_point[1], next_point[0] - prev_point[0])
??? # Convert to degree
??? track_direction = math.degrees(track_direction)
??? # Calculate the difference between the track direction and the heading direction of the car
??? direction_diff = abs(track_direction - heading)
??? if direction_diff > 180:
??????? direction_diff = 360 - direction_diff
??? # Penalize the reward if the difference is too large
??? DIRECTION_THRESHOLD = 10.0
??? if direction_diff > DIRECTION_THRESHOLD:
??????? reward *= 0.5
??? return float(reward)
Reward graph
On the reward graph, we can see how the model is progressing. The model has been trained on the re: invent 2018 track for 1:40 hours. The graph given below shows the percentage of track completed and average percentage completion for evaluating and training.
?Evaluation results
With the given reward function, hyperparameters, and training configuration, the agent completes the track in 11.679 seconds.
Recap
In this part of the series, we have presented four examples of reward functions. Each reward function has used different parameters to achieve the goal. All reward functions assisted the agent in completing the three laps during the evaluation.?
Conclusion
In these three parts, we reviewed AWS DeepRacer and how it can be used to learn machine learning in practice using cloud computing services. This article has been presented as part of Western Sydney University OpenDay 2021. You can watch the presentation in the following video.
Acknowledgement
This article is prepared by students at AWS Academy@Western Sydney University.
Aws, Python, PySpark with BigData, Databricks, Oracle SQL
2 年Thank you so much for your clear explanation. I pushed my AWS DeepRacer Student league leaderboard rank because of you. Love and support :)
Bachelor of Technology - BTech at Aditya Engineering College
2 年Thank you Bahman Javadi for this post. This post helps me to boost my leaderboard position in AWS DeepRacer Student League