What’s the hidden cost of a forecast accuracy metric? - Insights from the M5-competition
Johann ROBETTE
Beyond grand theory, I help companies ensure their Supply Chain actually drives value! ????
There are dozens of fancy forecasting metrics out there and making the right choice seems pretty tricky. Each has its specificities, its pros and its cons, its defenders and its detractors… Does your choice really matter? Are there metrics to avoid? Conversely, are there metrics that deliver more value than others, and if so how?much?
Addendum
This article was updated on July 4, 2021 to reflect various reader comments that identified limitations and potential biases in the first version of this article. Special thanks to Trixie Tacung (L’Oréal) and Ivan Svetunkov (CMAF, Lancaster University).
In the previous?episode…
This series of articles advocate the need for a new generation of forecasting metrics focusing on the business impact of forecasts rather than on their sole accuracy/precision.
In our previous article [1], we compared various forecasts to the quality of the business decisions they trigger.
To do so, we leveraged the “M5-competition” [2] dataset (which is based on Walmart’s data) and 74 forecasting methods from benchmarks and competitors. This enabled the simulation of more than 6,8M replenishment decisions.
Such a large workbench enabled a fair comparison of each method’ costs (measured from the decision they triggered) and various performance metrics (MAPE, wMAPE, sMAPE, MSLE, MAE, MSE, RMSE, WRMSSE, BIAS, DIao).
Interestingly, this test demonstrated that classical forecasting metrics are pretty poor at identifying the optimal method from a business perspective. Indeed, a method identified as “among the best” by a given metric might well be the worst from a business perspective.
Yet, not all metrics are created equal!
For example, the newly introduced DIao metric clearly outperforms any other metric when it comes to optimizing business decision. Indeed, this metric does not focus on forecast error but rather considers the decisions triggered and the associated costs.
What’s in this?article?
There’s a whole zoo of fancy metrics out there, and definitely, there’s no need to introduce new ones if the existing ones are “good enough” and if the added value is not?clear.
In this fourth article, we describe and implement a test framework that answers the following questions?:
Let’s share some experimental answers then!
Setting the test workbench
Dataset
In this experiment, we use the same “Walmart / M5-competition” dataset presented in our previous article. If you are interested in learning, please refer to “The last will be first, and the first last”… Insights from the M5-competition”
Regarding the replenishment policy, we use the assumptions described in the above article, except for the initial inventory that is set to zero. Indeed, our goal being to assess the true performance of metrics, we don’t want existing inventories to bias our measures. Each replenishment decision is then analyzed under the same “no initial inventory” assumption.
Correlation measures
We are here interested in measuring the correlation between various forecasting metrics and the cost of the decisions they trigger.
Correlation coefficients are used to measure the strength of the relationship between variables. More precisely, a correlation coefficient measures the linear relationship between two variables.
Yet, given the formulas for forecasting metrics (including weights, log, square, square roots, absolute values, and so forth), we cannot expect the relationship between a metric and its costs to be linear.
Instead, we might expect monotonicity in the form of “when a metric improves, the cost decreases”. Therefore, we apply Spearman’s rank-order correlation in this analysis.
Spearman’s rank-order correlation focuses not on the values themselves but their ranks. When ranks are perfectly correlated (rho=1), variables evolve in the same direction. When ranks are inversely correlated (rho=-1), variables evolve in the opposite direction. In both cases, from our perspective, the metric’s evolution perfectly follows the evolution of the costs.
But, as soon as the correlation coefficient differs from +1 or -1, the variables are no longer perfectly correlated. From a business point of view, such a case means that, although the prediction metric improves, the quality of decisions may decrease… which increases costs instead of generating value!
What a shame to spend so much effort and resources on improving a forecast in a way that won’t serve your business!
Performance metrics
Here, we focus on classical forecasting metrics including MAPE, wMAPE, sMAPE, MSLE, MAE, MSE, RMSE, WRMSSE, BIAS (the last being bias metrics).
Of course, we add to this list the newly introduced “Decision Impact” metric. Among the three “Decision Impact” metrics (namely DIna, DIno and DIao), DIao is selected as it focuses on the cost of error.
For more details about those DIao and the other two metrics, you could refer to: “Decision Impact” : 10 reasons to implement the new generation of business-oriented metrics [3].
Aggregation levels
Most practitioners evaluate and communicate their performance metrics at specific computation levels. For example, some demand planners will decide to select a single forecasting method and apply it to the entire (or global) scope. Others will decide to select separate methods for different sub-scopes such as by the category of product. At most, some demand planners will select a forecasting method by Item/Store.
Wouldn’t it be interesting to evaluate the impact of a chosen aggregation level? Let’s then add various aggregation levels to this test bench.
Below are the 12 levels selected (the number in parentheses is the number of nodes at that level)?:
Analyzing correlations
Computations done, the boxplot below displays the distribution of Spearman rho’s ranked correlations (averaged per aggregation level).
领英推荐
How to interpret this?
The table below gives some guidance on how to interpret Spearman rho’s ranked correlation.
As measured and according to the above guidelines:
Interestingly, as described in the graph below, metrics’ correlation to costs tend to differ at the high levels of aggregation. Unfortunately, replenishment decisions must be made at the Item/Store level, where each metric (with the exception of DIao) has the same?weak?correlation with costs.
Based on this test, DIao definitely seems to be a very appropriate metric. On the other hand, among classical metrics, no metric stands out at the decision level.
Undoubtedly,?correlation is essential from a scientific perspective. But, what is the additional value generated by switching from one metric to DIao? Is this extra value large enough to legitimize such a change?
Let’s define another test to check this!
Analyzing effective costs
For each of the above measures, let’s select the best 5 forecasts (from the 74 available forecasts). Selecting several forecasts instead of one makes our conclusions more reliable as it avoids classic pitfalls such as accidentally choosing a poorly performing forecast.
The costs of the selected forecasts are then calculated and averaged by metric to obtain the average cost that each metric triggers. These costs are displayed in the boxplot below.
This graph shows that:
How much value-add does that represent?
Let’s put these results in perspective and compare the costs of each metric to DIao, for each level of aggregation.
Regardless of the level of aggregation, the DIao metric consistently identifies the best-fit forecasts for a given decision process.
At high levels, since metrics correlate better with costs, additional savings are limited ($700 to $3,000). But as soon as metrics are calculated at more granular levels, such as the Item/Store level, the savings increase dramatically, ranging from $1.9K to $9.3K.
Well… is it great or is it?junk?
To better understand these numbers, let’s put them in context.
Below, we will focus on MAPE (because it is the most applied metric) and the Item/Store level (where replenishment decisions are made), thus the $9.3K in savings.
Our forecast period encompasses $2.88M in sales. Walmart’s annual sales for 2020 are set at $559.15 billion [5]. This means that our analysis represents 0.00052% of Walmart’s annual revenue. In addition, Walmart’s annual gross revenue for 2020 is $138.84 billion [5].
With these figures in mind, what is $9.3k in savings?
Conclusion
As we have shown, not all forecasting metrics are created equal!
Choosing the right metric for your business will greatly improve your performance! And conversely, using the wrong metric (like MAPE) can cost your company a lot of money.
Certainly, not all companies are the size of Walmart! However, a reduction in total costs of up to 35.2% has a real impact on any business. Even more so when you simply replace one metric with another.
While this first use case for “Decision Impact metrics” is great, there are many more use cases to share. We’ll look at each of them in future articles in this series.
Acknowledgements
Special thanks to Manuel Davy (Founder & CEO, Vekia), Stefan De Kok (Co-founder & CEO, Wahupa) and Hervé Lemai (CTO, Vekia).
This article aims to shed light on current practices, limitations and possible improvements of forecast performance measures. It’s for sure not perfect and suffers from limits.
If you found this to be insightful, please share and comment… But also, feel free to challenge and criticize. Contact me if you want to discuss this further!
In all cases, stay tuned for the next articles! In the meantime, visit our website www.vekia.fr to know more about our expertise and experience in delivering high value to Supply Chain.
Linkedin: www.dhirubhai.net/in/johann-robette/
Web: www.vekia.fr
References
[1] Vekia, Johann ROBETTE, The last will be first, and the first last — Insights from the M5-competition, 2021
[2] Kaggle, M5 competition website, 2020
[3] Vekia, Johann ROBETTE, “Decision Impact” : 10 reasons to implement the new generation of business-oriented metrics, 2021
[4] Gartner, Financial Plug Limits Forecast Accuracy, 2019
[5] Wall Street Journal, Walmart Inc., 2021
Bonjour, j'ai commencé à lire vos articles. Très novateur, par contre je ne vois pas dans l'article votre calcul du "cost"? C'est la valeur du stock ? Le co?t des ruptures ?
Software and photo ??
3 年Thank you Johan ! Great works. Rabin, Gwladys Loukakou and Ronan Fruit : isn’t it inspiring ?
Beyond grand theory, I help companies ensure their Supply Chain actually drives value! ????
3 年Dear all, Glad to share that this article was updated on July 4, 2021 to reflect various reader comments that identified limitations and potential biases in the first version of this article. Special thanks to Trixie Tacung (L'Oreal) and Ivan Svetunkov (CMAF, Lancaster University). Although some assumptions have been tuned, conclusions remain unchanged.
Professor at University of Sk?vde
3 年Nice read! The complication is always to design a simple enough metric, that does not require extensive and at times infeasible calculations/simulations. Any simplifications we do in the cost structure will tend to distance our metrics from the real business impact, but some do so gracefully. I think this is really where the difficulty is and more work is needed - I feel this needs to be a concerted effort from both academia and practice! ? Having said that, I agree with your scepticism on accuracy metrics. We at times take the easy route, and implicitly forecast for the sake of forecasting, rather than considering the impact of forecasts on decisions, and measuring their performance on that. The complication is on how we do this impact measuring!?With Juan Ramon Trapero Arenas and Devon K. Barrow we showed in a recent paper (https://doi.org/10.1016/j.ijpe.2019.107597) that for inventory management purposes you could do this by using simulation optimisation. Therein we also find that conventional accuracy metrics are weak, though for us bias seemed to be more relevant. We measured the impact on inventory performance metrics. If we were to measure otherwise, e.g., considering additional costs, WIP, profit, etc., we might observe a different strength of bias metrics to decisions, Critiquing our work, although the approach is flexible enough to account for these measures of impact, constructing an appropriate simulation can be challenging, and again simplifications will come in play. More work is needed in this. Your articles are very welcome to this end. But surely some metrics should be allowed to fade away in peace (sMAPE I am looking at you) as they are weak both on statistical terms for measuring accuracy and also for the arguments here by Johann ROBETTE ??. With Ivan Svetunkov and Juan Ramon Trapero Arenas we have been working over a long time on a more detailed paper on the connection of metrics (and especially bias) to decisions. It is under review, but happy to share a working version if interested.
Great article!