How You Can Make Better Use of Accuracy Measures for Improved Forecasting Performance

How You Can Make Better Use of Accuracy Measures for Improved Forecasting Performance

For some time, I have been exploring ways of making better sense of the M forecasting competition rankings of various Methods. Starting with the yearly M3 data (645 series out of a total of 3003), I know that most of the better-known models each produce predictable trend forecast profiles. A simple exponential smoothing algorithm (SES) always produces a level (no trend) forecast pattern over the forecast horizon starting from a fixed time origin.?Likewise, the Holt Method produces a straight-line forecast pattern, while the Holt-Winters Method produces a straight-line forecast pattern with a superimposed constant (additive or multiplicative) seasonal pattern. Of course, real data never look quite like those patterns, so we end up with forecast errors.

Forecast errors can be summarized and evaluated using a variety of forecast accuracy measures. Because no one accuracy measure can, in general, be proven superior in all cases, it is a best practice to use at least two such measures in your forecasting and demand planning toolkit for forecast performance evaluations. The M3 competition include the MAPE, sMAPE, MASE and MAE point forecast accuracy measures, where accuracy is measured point by point over the horizon.?

But, forecast patterns or profiles are the name of the game in the M-competitions, much like practical lead-time forecasting of budgets, sales and operations plans in supply-chain organizations. For forecast profile accuracy measurement, I will introduce an objective measure not readily found in conventional operational forecasting toolkits. For lead-time forecasts, the horizon is a ‘frozen’ period during which no overrides or changes can be made, unless you create a brand-new forecast from a new forecast origin. So, forecasts are evaluated over the entire horizon, in practice.

In my data exploration below, I will introduce a Kullback-Leibler divergence measure D(a|f) to measure profile accuracy in contrast to the sMAPE and MASE point forecast accuracy measures used in the M3 competition.

To the practitioner, the rationale for my approach will become clearer, as I go through the data analysis in steps. Because this is a data-driven exploration, it tends to reveal what should be done next. A conventional data-generating modeling approach tends to reveal what could be done, instead, in most cases.

The M-3 Forecast Competition Yearly Data

Of the 3003-time series in total, 645 are labeled as Yearly (i.e., non-seasonal). Starting with a training dataset, participants were not shown an 8-period holdout sample for subsequent model evaluations. As part of the competition, participants submitted forecasts for the eight periods for performance evaluation. The results were published in the International Journal of Forecasting 16 (2000) 451–476, and the dataset has been freely available online for a couple of decades now. There were 24 Methods compared in the competition, but I will use four of the highest-ranking Methods to bring out the essence of my findings.

To the Gaussian Mindset, an Arithmetic Mean is Something You COULD Always Calculate, But SHOULD You Always?

Step 1. The sMAPE and MASE distributions are shown in box and whisker plots because they highlight whether summarizing typical values of central tendency should me made with the arithmetic mean. Each distribution is highly skewed with numerous extreme values. Besides a number of other input data quality issues with the M3, this demonstrates an inadequate basis for comparing sMAPE and MASE accuracies with the arithmetic mean.

There is much variation, skewness and extremeness in the underlying numbers to justify comparing central tendencies to two decimal places in the Tables. The other methods, not surprisingly, depict very similar distributions, like these box plots. This distorts the meaning of the arithmetic mean as a typical summary for the purpose of ranking Methods. In this case, there is no credible basis for differentiating or ranking Methods.?

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Forecast Accuracy Alone May Not Be Adequate for Assessing the Performance of Lead-time Forecasts

Another approach taken by the M3 competition organizers is to calculate accuracy spreads (the difference between the accuracy measure and a corresponding na?ve N2 benchmark accuracy). With non-seasonal data, Naive1 (N1) is the same as Naive2 (N2), so both produce level projections over the forecast horizon. ?If an accuracy spread is positive, then a Method is said to add value or contribute to the effectiveness of the Method. The scatter plots show, however, that the accuracy spreads depend on the size of the benchmark accuracy and should not be simply averaged to get meaningful insights. There appears to be a positive scatter, suggesting that accuracy spreads should be divided by the benchmark accuracy level.

No alt text provided for this image
No alt text provided for this image

Step 2. Effectiveness of a method is defined by the ratio of positive Accuracy Spreads (AS) to a benchmark accuracy measure. When we divide accuracy spread by the benchmark accuracy, we get data less correlated with the sizes of a benchmark accuracy measure. This ratio turns out to be a proper skill score for sMAPE, MASE, and others. Using Na?ve N1 as the benchmark, we can display proper skill scores for N1 sMAPE Skill Score and N1 MASE Skill Score for a Method, like the ones shown below. From this, we can consider calculating a typical value for the effectiveness the Method. For the COMB H-S-D Method, 65% of the N1 sMAPE accuracy spreads are positive (effective).

No alt text provided for this image

Step 3. Using information-theoretic concepts for a Profile Analysis, demand planners and forecast practitioners can assess accuracy measures and forecast performance evaluations of lead-time demand forecasts in a more objective manner than by using point forecast accuracy measures. This information-theoretic methodology was previously introduced in several of my articles: (Dec 31, 2020), (Jan 17, 2021), (Feb 18, 2021), (Apr 10, 2021), (May 5, 2021), and (May 30, 2021) on my LinkedIn Profile.

I now describe the performance for a forecasting Method that creates non-seasonal forecast profiles over a predetermined time horizon from a fixed origin. The first step is to encode or map a lead-time forecast into a Forecast Alphabet Profile (FAP), which has the same pattern as the forecast except that the data have been rescaled. In this step, the FAP values are created by dividing a lead-time Total into each of the values of the respective profiles. Similarly, the Actual Alphabet Profile (AAP) is obtained by dividing each actual by the sum of the actuals over the predetermined horizon. This makes all profiles comparable for forecast performance evaluations and suitable for applying the Kullback-Leibler divergence measure D(a|f), described next.

A Profile Error (PE) is defined as a difference between the profile values of the AAP and an FAP, but not in the conventional sense of measuring accuracy by ‘Actual minus Forecast’ differences. Here, the PE is defined as

No alt text provided for this image

Profile Accuracy can be measured with a ‘distance’ measure between a FAP and AAP, defined by the Kullback-Leibler divergence measure D(a|f)

No alt text provided for this image

D(a|f)?can be interpreted as a measure of ignorance or uncertainty about Profile Accuracy. When D(a|f) = 0, the alphabet profiles overlap, which is considered to be 100% accuracy. The profile accuracy D(a|f) is greater than zero and equals zero if and only if a(i) = f(i), for all i. In other words, when the forecast profile is identical to the pattern of the actuals.

The proper skill score associated with D(a|f) is called the Levenbach L-Skill Score, because the actuals and the forecasts are fractions, but not probabilities, summing to one.

We can display the relationship that an effective skill score has with the associated forecast accuracy measure. One would expect to see better (smaller) accuracy associated with a better (higher) skill score. That does not appear to be the case, in general, so accuracy measures do not adequately represent effective forecasting performance.

Furthermore, the charts show that the effectiveness of a Method also depends on the skill score used. So, we advise demand planners, managers and supply chain forecast practitioners to evaluate Methods with multiple skill scores with a single measure. The bottom chart re-enforces this practice because not all the series will be the same under different skill score performance measurements. ?

No alt text provided for this image

Establishing the Effectiveness Rating and Skill Score Rankings of a Forecasting Method

Step 4. There is no Gaussian tradition in these histograms, which precludes us from using the arithmetic mean credibly. Moreover, the range of the data is restricted to 0 and 1, so a logit transformation can be suggested to place the numbers on a linear support basis (i.e., the real line) to establish a typical summary more reliably for ranking Methods. By untransforming (inverse logistic) a typical summary, we create a typical skill score rating. Then, as an overall ranking, we create a weighted average of skill score ratings with effectiveness percentages as weights.?

No alt text provided for this image

Summary Table with % Effective, Typical Skill Score for sMAPE and D(a|f) with the N1 Benchmark.

No alt text provided for this image

To calculate a Weighted Effectiveness Rating (WER) for COMB, for example,

WER = [(0.63 * 0.34) + (0.59 * 0.42)] / (0.63 + 0.59) = 0.38

If we need to add the MASE, MAPE or any other proper skill score, we will just add them to the WER formula.

Final Note

1.?????Because this analysis represents a single forecasting/planning cycle, the practitioner should not declare a Triple Crown winner until several more lead-time forecasting cycles are run, and a pattern of results emerges.

2.?????The OWA, used in the M4 Competition, turns out to be an unweighted WER.

Takeaways

  • Forecast accuracy alone may not be adequate for assessing the performance of lead-time forecasts.

No alt text provided for this image

  • I made Exploratory Data Analysis (EDA) a major theme in my books, especially Change & Chance Embraced: Achieving Agility with Smarter Forecasting in the Supply Chain.
  • What you may find, once you actually start looking at real data in quantity, is that just a few unusual values or outliers can have a big impact on how you should be making comparisons and summarizing results. I learned this early in my career from John W. Tukey (1923 - 2000) who looms large over the field of data visualization and data science, generally.

No alt text provided for this image

  • Tukey famously coined the term “bit” and invented the box and whisker plot. He developed Exploratory Data Analysis?(EDA) as a best practices tool for data scientists in his 1977 book Exploratory Data Analysis and also directed the?Princeton Robustness study (1970-1971).
  • The spreadsheet environment is perhaps still more flexible for data explorations than commercial forecasting and planning systems today. One generally requires simple operations and visualizations for which a spreadsheet environment shines. Also, the software languages, like R and Python are open source, modern and convenient software tools to analyze and perform forecasting performance evaluations.

No alt text provided for this image

Nakul Doshi

Marketing Enthusiast!

2 年

To deal with the complexities of the evolving world of work, intuitive workforce analytics systems enable ease of access to critical employee data, performance evaluations and is beneficial to the overall organisation’s operations?https://s.peoplehum.com/wed8e

回复
Roshni Tiwari

Senior Digital Sales Manager at peopleHum | Digital transformation | HR digitalisation | AI Automation

2 年

To deal with the complexities of the evolving world of work, intuitive workforce analytics systems enable ease of access to critical employee data, performance evaluations and is beneficial to the overall organisation’s operations?https://s.peoplehum.com/wed8e

回复
Anees Burhani

VP of Supply Chain at GHRA

2 年

Interesting. Always looking for unique way to measure accuracy.

Azza Elshahat

Director at government sector

2 年

Great effort ,Thanks for share your expertise

Julio Cesar Gonzalez Cruz

Gerente Nacional de Cadena de Suministro | Supply Chain Manager | Textil | Moda | Retail | Centros Distribución | CPIM V8.0

2 年

Great Hans as usually.Thanks for share your expertise...!!!

回复

要查看或添加评论,请登录

Hans Levenbach PhD CPDF的更多文章

社区洞察

其他会员也浏览了