登录查看更多内容

Here is how an R-Learning agent beats humans

Riccardo Castellani

Managing Partner at ElioVitale&Co | 控股公司 Global Equity Investment

发布日期: 2022年5月7日

Human intuition is usually good at dealing with concepts like averages or mean-values, whereas it often performs poorly when it comes to handling concepts like volatility and uncertainty of parameters. Managing the inventory of a company, redirecting fluxes within an energy grid, re-balancing an investment portfolio, or even landing a rocket re-entering from orbit, are all activities we would probably approach through average parameters if not provided with better solutions. Unfortunately, mean values work well only when specifically framed, like within the Central Limit Theorem. The difference between good and bad performances, or between gains and losses, is often to be found in how well volatility is handled rather than averages. In that regard, this article wants to show something arguably remarkable. The reinforcement-learning algorithm used in another recent post and focused on managing the inventory of a car dealership is here boosted through a hidden neural network. A slightly different application is performed and performances are compared to human ones. The result clearly shows the constant gap between the two. Because of the nature of the reinforcement learning math deployed here, the training of the algorithm is executed through simple interaction with its environment without the need to reference past examples or data. Because of that, the algorithm can find optimal behaviors possibly beyond the limitations of its trainer. That is what happens in this article and what will be shown below.

Brief recap: our problem is to handle the inventory of a US car dealer having to place monthly orders to replenish its inventory and fulfill possible sales for the next month. It will then experience the reward in terms of profit constituted as per the following: sales of available cars, minus inventory & shipping costs, cost of lost sales). If the inventory was too high we would experience high storage costs, if it was too low we would lose possible sales. Monthly sales are determined by random draw from a normal distribution based on the statistics of the specific brand, the only thing we assume a dealer would have to make decisions on monthly orders. Here are the statistics of the three brands of cars we will consider in this post:

Figure above: parameters and statistics of the vehicles we will consider in this article

The reinforcement learning algorithm is boosted by trying to replicate the technique which we know as having been developed and successfully deployed in the game “Go” by DeepMind / Google. That is a reinforcement learning algorithm experiencing the reward from its actions adopted at specific states, provided with a neural network to boost its ability to assign correct values to the action-state pairs – it is almost a requirement considering the possible infinite action-state space it must handle. In the previous post, we were not deploying that hidden neural network, and we gave the algorithm “visibility” of the environment through a matrix of 10 rows and 3 columns – one matrix for each brand of vehicle. We were therefore limiting its vision to 10 possible levels of inventory per brand (1 to 10 vehicles stored) and to 3 actions (order either 0, 1, or 2 vehicles per brand at any given month). Replacing those matrices with neural networks (NN) we do not have anymore that limitation. In particular, while the level of inventory can now be any number (the number fed the NN) the possible order of new vehicles, while still limited, has a wider range, 1 to 10. While we could make the space of the order’s size open-ended as well, we limited it to possibly speed up the convergence of the solution – in part because of the limited computational capacity of our tools. We want to stress that, even if we are now leveraging a neural network, we still do not need past examples or data during training. The training will still be executed by the reinforcement algorithm leveraging the reward it experiences while interacting with the environment. As per DeepMind’s implementation, the deployed neural networks are two rather than one. That is to allow for a better convergence during training by having the main NN gradually adjusts its weights by referencing the second NN, which is more rapidly updating its parameters.

Important note: at the end of the version of the same article published on our website here, the interested reader can find an additional brief mathematical digression on our neural network. Having multiple neurons per layer while using only one parameter as input to the NN (the number of vehicles in inventory) has little use since the linear combination across a single layer collapses as if it was a single neuron. Conversely, the importance of having subsequent layers remains because it captures the non-linearity. That is because, at the exit of each layer, whatever linear combination is obtained, it is then multiplied by a non-linear function being here the ReLu function. The reason why we preferred to develop a more complete model where each layer still carries multiple neurons is that, in general, anything we do has real application as a reference. Therefore, we wanted to develop something that can be possibly applied to more complex real cases. More on this in the mathematical digression at the end.

Results

Let us anticipate some results of our application:

In general, the R-algorithm out-performs human behavior
The measure to which R-learning outperforms human behavior is directly related to the volatility of parameters. Very limited volatility is the only case where results are comparable
The "rules" adopted by the R-learning algorithm are not that intuitive, showing the potential of reinforcement learning that does not reference examples or past data, and just strives to find the best behavior for the statistics it is provided

We should now briefly explain what we used as a reference for what we are referencing as "human behavior". Rather than asking a supply-china person what action s/he would make at specific states, we built two simple algorithms.

The first one would experience monthly sales (random draw from a normal distribution based on the statistics of the specific brand) and it would order the number of cars needed to take the inventory back to a level equal to the average expected monthly sales for that specific brand. Say a specific brand sells on average 3 cars per month, the algorithm would always try to replenish that average inventory each month
The second algorithm would always order the average monthly sales for that specific brand, regardless of the level of inventory or the actual sales

We will see immediately below that while being more simple, the latter algorithm performs better than the former. That may be unexpected but, again, we are not good at thinking in terms of volatility.

For sake of training, we ran 2000 simulations of a process managing the inventory for 640 consecutive months (that is because it was convenient to train the algorithm with 32 batches repeated 20 times: 32 x 20 = 640). During those 2000 runs, the algorithm would adjust its parameters while learning. The maximum cumulative profit obtained by each algorithm will be the average of about 200 simulations performed by using the trained parameters identified through the previously executed 2000 training iterations. That profit will be determined by the monthly sales lowered by the total costs, including storage & shipping costs (detailed right below), and including lost sales in case of monthly sales being greater than carried inventory. Here are the details of the costs:

Total costs: [5% of the selling price x inventory units] + [10% of the selling price x the order of the month] + lost sales. We determined those 5 and 10% by some experience and possible common-sense considerations
Lost sales = maximum between 0 and the difference between random sales and available inventory

We can now show the cumulative profits per each brand obtained by, respectively, the reinforcement learning algorithm, the first human-like behavior always replenishing the average inventory, and the second human-like behavior always referencing the average monthly sales by ordering the same number of vehicles every month:

Figure above: cumulative profits across 640 months of the three algorithms

The R-Learning algorithm always beats the two human-like behaviors. Moreover, it performs better than the rest in a measure directly related to the volatility of the sales for the specific brand (as a percentage of the average expected sales). It is interesting to note that ordering always the same number of vehicles equal to the average monthly sales (human-like_2) does not perform too badly, and it performs better than the approach based on ordering just the number of cars needed to replenish the average inventory (human-like_1). Human-like_1 is indeed often caught by surprise with not enough inventory to cover spikes in sales. We can look at the two pictures immediately below representing two simulations of the inventory in time for the two human-like approaches (based on brand b and presenting in orange the times the carried inventory results being not enough):

Figure above: human-like_1 inventory in time (aiming at replenishing average inventory)

Figure above: human-like_2 inventory in time (aiming at always placing orders equaling the average sales for the specific brand)

Algorithm “human-like 1” more often is left with no vehicles in inventory (therefore often losing sales) having an average inventory of about <0.5 cars. The algorithm “human-like 2” performs better because it maintains an average inventory of about >1 vehicle. Because of the volatility, even though human-like_2 focuses on average sales rather than average inventory, it is the one managing to guarantee a better level of inventory.

It is interesting now to understand how the R-Learning algorithm manages to beat human_like_2? Having only one input to the neural network, R-learning can be easily reverse-engineered, and we can find the rules the R-algorithm is following for the specific brand (b). Here they are:

If the inventory is below or equal to 2 vehicles, order 2 more vehicles
If the inventory is greater than 2 vehicles, order 0 vehicles

领英推荐

Ahead of AI #6: TrAIn Differently

Sebastian Raschka, PhD 1 年前

What is Reinforcement Learning (RL)? Explained

Blockchain Council 8 个月前

?? The End of Lazy LLMs

Pascal Biese 10 个月前

Here is how the level of inventory for the same brand (b) would appear in time when handled by the machine according to those rules:

Figure above: R-learning inventory in time (obtained by following the rules outlined above)

The R-algorithm maintains an average inventory of about 2 vehicles, and it is seldom left without cars – moreover, when inventory goes into the orange zone, it does it in a lower measure, meaning lost sales are lower. Because of the penalization of profit related to lost sales, the algorithm makes sure inventory can cover possible sales and their volatility. Moreover, it is careful to position inventory just at that needed level without increasing it too much and wasting too much money on storage & shipping costs for unsold vehicles.

According to the standard distribution we draw sales from and the standard deviation of the brand (b) equal to 2 vehicles, volatility can increase sales by 2 units with about 63%/2 chances, by 2x2 units with about 95%/2, and by 2x3 units with about 99%/2 chances (positive half of the standard distribution with the usual ranges of probability at 1, 2, and 3 standard deviations).

Let us now look at the brand (h), having average sales equal to 3 and a standard deviation equal to 3. Here is the inventory that human-like_2 would maintain in time:

While the R-learning would determine the following inventory:

While it may seem that in this case, R-learning would out-perform human-like_2 even more, the opposite is true, coherent with the lower volatility as a percentage of the mean value. R-learning outperforms human-like_2 in a lower measure, $62 M vs $50 M. While R-learning manages to lose only a few sales, it is forced to carry a higher inventory and higher storage & shipping costs. However, it is almost fascinating to note however that the algorithm works hard to get those extra millions: believe it or not, to find that sweet spot where it still manages to make the difference compared to human-like_2, it follows the following rules:

If the inventory is below or equal to 3, order 6 vehicles
If the inventory is greater than 3, order 1 vehicle

It may seem strange that those rules would maintain the constant inventory shown above, but it can be easily replicated in excel: starting from an inventory equaling average sales (3), subtracting numbers randomly drawn from a normal distribution with parameters (3,3), and adding vehicles every month according to the rules above, we would obtain a profile similar to the one shown above.

Finally, for the brand (f) having the lowest volatility, the R-algorithm finds exactly the same rule adopted all the times by “human-like 2”, which is to constantly order the brand’s average sales (4) - that is coherent with the low volatility which make it convenient to stick to the mean value. Inventories would look pretty much the same for all the three algorithms, as per the final profit equaling $55 M in all three cases and shown in the initial table – note, in this case, human-like_1 performs pretty well as well since even referencing the average inventory all the times leads to the same outcome.

Conclusion

Since the performances obtained by the R-learning algorithm should be evident by now, we can highlight a couple of key points.

Especially when it is not possible to live-train an algorithm and simulation must be conducted to train the model, it is critical to model the reward as per the final actual application. As shown above, R-learning critically positions itself by figuring the right balance between inventory & shipping costs and lost sales. Had we given more importance to lost sales, the algorithm would have carried on average higher inventory
If possible, it would be probably better not to immediately deploy the neural network adopted here, but rather obtain a first rough result through a simpler model like the one leveraging a more easy-to-investigate matrix – as per the previous article that we referenced above and which can be found again here
At least according to our experience, when working with reinforcement learning, it may be even more important to work with people familiar with the specific domain. Since R-learning references the reward to find optimal behaviors, it is critical to model that reward correctly and stress-test the result even beyond the limits of the particular application. That is to ensure a “safe” final deployment

As anticipated above, the interested reader can find on our website an additional ending mathematical digression on the math of the hidden neural networks, here.

To conclude, please feel free to connect or be in touch with comments, proposals, and anything else which would allow us to connect and even develop collaborations: [email protected]

Riccardo

---

Tags of the images used for the following customization within the main image:

https://unsplash.com/@elenapopova

https://unsplash.com/@oskark

Here is how an R-Learning agent beats humans

Riccardo Castellani

Managing Partner at ElioVitale&Co | 控股公司 Global Equity Investment

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

The Most Important Lesson in AI

Generative AI for Image Generation - GAN

Is Machine Learning a Part of Artificial Intelligence?

10 Innovative Approaches to Machine Learning

Inexplicably Explainable AI

Why A.I. is Disrupting The Small Business Space (should you be worried)

Optimize to actualize: The impact of hyperparameter tuning on AI

The 'GenAI' Obsession!

Understanding Popular Optimization Techniques in Machine Learning

Understanding Variational AutoEncoders: A Simple Guide

领英推荐

Apple M-chip advantage revealed by I^2xR

2023年2月21日

Purchasing a house right now? Makes lots of sense

2022年12月1日

R-Learning AI self-taking over processes

2022年4月14日

Learning from the private equity acquisition of Selle Royal Group

2022年3月17日

Leibniz intuition on F1 carbon frames

2022年3月4日

Business case of a bio-gas project in Asia

2022年1月28日

Bio-gas and the greys of green solutions - Mechanical Engineering

2022年1月13日

Something must happen to Goldman's stock

2022年1月3日

Three business-administration technical lessons: rates, spin-offs, and M&A

2021年11月9日

General principles on handling our cross border taxes

2021年10月13日

社区洞察

其他会员也浏览了

The Most Important Lesson in AI

Generative AI for Image Generation - GAN

Is Machine Learning a Part of Artificial Intelligence?

10 Innovative Approaches to Machine Learning

Inexplicably Explainable AI

Why A.I. is Disrupting The Small Business Space (should you be worried)

Optimize to actualize: The impact of hyperparameter tuning on AI

The 'GenAI' Obsession!

Understanding Popular Optimization Techniques in Machine Learning

Understanding Variational AutoEncoders: A Simple Guide