Here is how an R-Learning agent beats humans
Riccardo Castellani
Managing Partner at ElioVitale&Co | 控股公司 Global Equity Investment
Human intuition is usually good at dealing with concepts like averages or mean-values, whereas it often performs poorly when it comes to handling concepts like volatility and uncertainty of parameters. Managing the inventory of a company, redirecting fluxes within an energy grid, re-balancing an investment portfolio, or even landing a rocket re-entering from orbit, are all activities we would probably approach through average parameters if not provided with better solutions. Unfortunately, mean values work well only when specifically framed, like within the Central Limit Theorem. The difference between good and bad performances, or between gains and losses, is often to be found in how well volatility is handled rather than averages. In that regard, this article wants to show something arguably remarkable. The reinforcement-learning algorithm used in another recent post and focused on managing the inventory of a car dealership is here boosted through a hidden neural network. A slightly different application is performed and performances are compared to human ones. The result clearly shows the constant gap between the two. Because of the nature of the reinforcement learning math deployed here, the training of the algorithm is executed through simple interaction with its environment without the need to reference past examples or data. Because of that, the algorithm can find optimal behaviors possibly beyond the limitations of its trainer. That is what happens in this article and what will be shown below.
Brief recap: our problem is to handle the inventory of a US car dealer having to place monthly orders to replenish its inventory and fulfill possible sales for the next month. It will then experience the reward in terms of profit constituted as per the following: sales of available cars, minus inventory & shipping costs, cost of lost sales). If the inventory was too high we would experience high storage costs, if it was too low we would lose possible sales. Monthly sales are determined by random draw from a normal distribution based on the statistics of the specific brand, the only thing we assume a dealer would have to make decisions on monthly orders. Here are the statistics of the three brands of cars we will consider in this post:
Figure above: parameters and statistics of the vehicles we will consider in this article
The reinforcement learning algorithm is boosted by trying to replicate the technique which we know as having been developed and successfully deployed in the game “Go” by DeepMind / Google. That is a reinforcement learning algorithm experiencing the reward from its actions adopted at specific states, provided with a neural network to boost its ability to assign correct values to the action-state pairs – it is almost a requirement considering the possible infinite action-state space it must handle. In the previous post, we were not deploying that hidden neural network, and we gave the algorithm “visibility” of the environment through a matrix of 10 rows and 3 columns – one matrix for each brand of vehicle. We were therefore limiting its vision to 10 possible levels of inventory per brand (1 to 10 vehicles stored) and to 3 actions (order either 0, 1, or 2 vehicles per brand at any given month). Replacing those matrices with neural networks (NN) we do not have anymore that limitation. In particular, while the level of inventory can now be any number (the number fed the NN) the possible order of new vehicles, while still limited, has a wider range, 1 to 10. While we could make the space of the order’s size open-ended as well, we limited it to possibly speed up the convergence of the solution – in part because of the limited computational capacity of our tools. We want to stress that, even if we are now leveraging a neural network, we still do not need past examples or data during training. The training will still be executed by the reinforcement algorithm leveraging the reward it experiences while interacting with the environment. As per DeepMind’s implementation, the deployed neural networks are two rather than one. That is to allow for a better convergence during training by having the main NN gradually adjusts its weights by referencing the second NN, which is more rapidly updating its parameters.
Important note: at the end of the version of the same article published on our website here, the interested reader can find an additional brief mathematical digression on our neural network. Having multiple neurons per layer while using only one parameter as input to the NN (the number of vehicles in inventory) has little use since the linear combination across a single layer collapses as if it was a single neuron. Conversely, the importance of having subsequent layers remains because it captures the non-linearity. That is because, at the exit of each layer, whatever linear combination is obtained, it is then multiplied by a non-linear function being here the ReLu function. The reason why we preferred to develop a more complete model where each layer still carries multiple neurons is that, in general, anything we do has real application as a reference. Therefore, we wanted to develop something that can be possibly applied to more complex real cases. More on this in the mathematical digression at the end.
Results
Let us anticipate some results of our application:
We should now briefly explain what we used as a reference for what we are referencing as "human behavior". Rather than asking a supply-china person what action s/he would make at specific states, we built two simple algorithms.
We will see immediately below that while being more simple, the latter algorithm performs better than the former. That may be unexpected but, again, we are not good at thinking in terms of volatility.
For sake of training, we ran 2000 simulations of a process managing the inventory for 640 consecutive months (that is because it was convenient to train the algorithm with 32 batches repeated 20 times: 32 x 20 = 640). During those 2000 runs, the algorithm would adjust its parameters while learning. The maximum cumulative profit obtained by each algorithm will be the average of about 200 simulations performed by using the trained parameters identified through the previously executed 2000 training iterations. That profit will be determined by the monthly sales lowered by the total costs, including storage & shipping costs (detailed right below), and including lost sales in case of monthly sales being greater than carried inventory. Here are the details of the costs:
We can now show the cumulative profits per each brand obtained by, respectively, the reinforcement learning algorithm, the first human-like behavior always replenishing the average inventory, and the second human-like behavior always referencing the average monthly sales by ordering the same number of vehicles every month:
Figure above: cumulative profits across 640 months of the three algorithms
The R-Learning algorithm always beats the two human-like behaviors. Moreover, it performs better than the rest in a measure directly related to the volatility of the sales for the specific brand (as a percentage of the average expected sales). It is interesting to note that ordering always the same number of vehicles equal to the average monthly sales (human-like_2) does not perform too badly, and it performs better than the approach based on ordering just the number of cars needed to replenish the average inventory (human-like_1). Human-like_1 is indeed often caught by surprise with not enough inventory to cover spikes in sales. We can look at the two pictures immediately below representing two simulations of the inventory in time for the two human-like approaches (based on brand b and presenting in orange the times the carried inventory results being not enough):
Figure above: human-like_1 inventory in time (aiming at replenishing average inventory)
Figure above: human-like_2 inventory in time (aiming at always placing orders equaling the average sales for the specific brand)
Algorithm “human-like 1” more often is left with no vehicles in inventory (therefore often losing sales) having an average inventory of about <0.5 cars. The algorithm “human-like 2” performs better because it maintains an average inventory of about >1 vehicle. Because of the volatility, even though human-like_2 focuses on average sales rather than average inventory, it is the one managing to guarantee a better level of inventory.
It is interesting now to understand how the R-Learning algorithm manages to beat human_like_2? Having only one input to the neural network, R-learning can be easily reverse-engineered, and we can find the rules the R-algorithm is following for the specific brand (b). Here they are:
领英推荐
Here is how the level of inventory for the same brand (b) would appear in time when handled by the machine according to those rules:
Figure above: R-learning inventory in time (obtained by following the rules outlined above)
The R-algorithm maintains an average inventory of about 2 vehicles, and it is seldom left without cars – moreover, when inventory goes into the orange zone, it does it in a lower measure, meaning lost sales are lower. Because of the penalization of profit related to lost sales, the algorithm makes sure inventory can cover possible sales and their volatility. Moreover, it is careful to position inventory just at that needed level without increasing it too much and wasting too much money on storage & shipping costs for unsold vehicles.
According to the standard distribution we draw sales from and the standard deviation of the brand (b) equal to 2 vehicles, volatility can increase sales by 2 units with about 63%/2 chances, by 2x2 units with about 95%/2, and by 2x3 units with about 99%/2 chances (positive half of the standard distribution with the usual ranges of probability at 1, 2, and 3 standard deviations).
Let us now look at the brand (h), having average sales equal to 3 and a standard deviation equal to 3. Here is the inventory that human-like_2 would maintain in time:
While the R-learning would determine the following inventory:
While it may seem that in this case, R-learning would out-perform human-like_2 even more, the opposite is true, coherent with the lower volatility as a percentage of the mean value. R-learning outperforms human-like_2 in a lower measure, $62 M vs $50 M. While R-learning manages to lose only a few sales, it is forced to carry a higher inventory and higher storage & shipping costs. However, it is almost fascinating to note however that the algorithm works hard to get those extra millions: believe it or not, to find that sweet spot where it still manages to make the difference compared to human-like_2, it follows the following rules:
It may seem strange that those rules would maintain the constant inventory shown above, but it can be easily replicated in excel: starting from an inventory equaling average sales (3), subtracting numbers randomly drawn from a normal distribution with parameters (3,3), and adding vehicles every month according to the rules above, we would obtain a profile similar to the one shown above.
Finally, for the brand (f) having the lowest volatility, the R-algorithm finds exactly the same rule adopted all the times by “human-like 2”, which is to constantly order the brand’s average sales (4) - that is coherent with the low volatility which make it convenient to stick to the mean value. Inventories would look pretty much the same for all the three algorithms, as per the final profit equaling $55 M in all three cases and shown in the initial table – note, in this case, human-like_1 performs pretty well as well since even referencing the average inventory all the times leads to the same outcome.
Conclusion
Since the performances obtained by the R-learning algorithm should be evident by now, we can highlight a couple of key points.
As anticipated above, the interested reader can find on our website an additional ending mathematical digression on the math of the hidden neural networks, here.
To conclude, please feel free to connect or be in touch with comments, proposals, and anything else which would allow us to connect and even develop collaborations: [email protected]
Riccardo
---
---
Tags of the images used for the following customization within the main image:
https://unsplash.com/@elenapopova
https://unsplash.com/@oskark