A Deep Dive into Finding Best Performing Methods for Lead-time Demand Planning and Forecasting

A Deep Dive into Finding Best Performing Methods for Lead-time Demand Planning and Forecasting

Most business (macro) and demand (micro) planners and forecast practitioners know there are many different kinds of forecasting techniques available, ranging from elementary and more complex ARIMA time series to econometric models and newer computer-intensive statistical/machine learning (SL/ML) approaches, recently popularized in the M5-forecasting competition.?In the supply chain, e-commerce forecasts are needed when?

·????????planning product mix based on future patterns of retail demand at the item, product group and store level over a planning horizon

·????????setting safety stock levels for SKUs at multiple locations for inventory planning

·????????conducting S&OP and annual budget planning meetings for demand planning

When participating in the M3 Competition some thirty years ago, I was prompted to submit a modelling approach that could achieve high forecasting accuracy and best performance for lead-time demand forecasts in a ‘big data’ setting. The technology available at the time (10Mhz 8088s on IBM PC-AT with optional math co-processor chip and a 20 MB hard disk drive,?included as standard) did not allow for much data analysis. So, now I would like to explore?the M3 competition data with exploratory data analysis (EDA) tools to get better insights into the effectiveness of the various methods and models.?

I have become somewhat skeptical of the published M3-competition as my preliminary explorations revealed significant input data quality issues, like negative forecasts, clear outliers and un-noted data anomalies that can impact forecasting performance results with unintended consequences. In the M3 competition, the ‘big dataset’ is comprised of 3003 business time series. From these, I selected 1428 monthly series, and proceeded on a careful data cleansing exercise.

Why Use Exploratory Data Analysis (EDA)?

No alt text provided for this image

The most widely used accuracy measure in demand planning is the Mean Absolute Percentage Error (Mean APE). While King of the APEs, the MAPE has to be used with caution because a zero actual (e.g., intermittent demand) produces an indeterminate result (i.e., infinite big APE).?However, for performance measurement purposes, the key objective is to determine a typical value of central tendency, In this case, as well as for other accuracy measures, a smarter option is to use M-estimation to create a Typical APE (TAPE) ?measure instead of using an arithmetic mean to average results (the Gaussian mindset). To start with, the median APE (MdAPE) is also a typical measure of central tendency and performs better than the arithmetic mean.

The Gaussian mindset: An arithmetic mean is something you COULD always calculate, but SHOULD you always?

No alt text provided for this image

What you may find, once you actually start looking at data, is that just a few unusual values or outliers can have a big impact on how you should be making comparisons and summarize results. I learned this early in my career from John W. Tukey (1923 - 2000) who looms large over the field of data visualization and data science generally. He famously coined the term “bit” and invented the box plot. Tukey developed EDA best practices in his 1977 book Exploratory Data Analysis and also directed the Princeton Robustness study (1970-1971).?

Other accuracy measures, also based on the arithmetic Mean, like sMAPE, MASE, sMASE, and MAE, may also get misinterpreted with the same shortcomings as the arithmetic mean. When forecasting with a moving origin, overrides and management adjustments are commonly made over the horizon period and it may then a best practice to use MSE, MAE, MAPE, sMAPE, MASE, and variations thereof for performance analysis. but this may not appropriate for lead-time demand forecasts.

A Profile Analysis for Forecast Model Evaluations

When a lead-time forecast gets created, it is defined as a multi-step ahead forecast from a fixed origin and for a predetermined time horizon. The M-competition forecasts are lead-time forecasts in that sense. In practice, the lead-time is regarded as a frozen period in which no changes or overrides are made. Hence, a lead-time forecast can be viewed as a one-step ahead forecast of a historical pattern or profile. It seems a questionable practice when a “best method” becomes identified without repeating lead-time forecasting cycles with holdout samples, as should be recommended for a best practice.?For practical reasons, a best practice should require multiple, follow-up lead-time forecasts to be created with identical time horizons. Though possible, that has not been a best practice with the M-competitions.

?Using information-theoretic concepts for profile analysis, demand planners and forecast practitioners can assess accuracy measures and forecast performance evaluations of lead-time demand forecasts in a more objective manner

This information-theoretic methodology was previously introduced in several of my articles: (1, Dec 2020), (2, Jan 2021), (3, Feb 2021) and (4, March 2021) on my LinkedIn Profile. The performance of an e-commerce forecasting process that creates trend/seasonal forecast profiles is now outlined. The first step is to encode or map a lead-time forecast into a Forecast Alphabet Profile (FAP), which has the same pattern as the forecast except that the data have been rescaled (see the spreadsheet example below). In this step, the FAP values are created by dividing a Lead-time Total into each component of the respective profiles. Likewise, the Actual Alphabet Profile (AAP) is obtained by dividing each actual by the sum of the actuals over the predetermined horizon. This makes all profiles comparable for forecast performance evaluations.?

A Profile Error is defined as a ‘difference’ between the profile of the actuals and a forecast profile, but not in the conventional sense of measuring accuracy by ‘Actual minus Forecast’ differences. Here, the difference between Actual and Forecast alphabet profiles is given by the formula

No alt text provided for this image

Accuracy can be measured with a ‘distance’ measure between a Forecast Alphabet Profile (FAP) and Actual Alphabet Profile (AAP). A forecast Profile Miss (PMISS) is given by the sum

No alt text provided for this image
No alt text provided for this image

A forecast Profile Accuracy is defined by the Kullback-Leibler divergence measure D(a|f). The sum can be interpreted as a measure of ignorance or uncertainty about Profile Accuracy. When D(a|f) = 0, the alphabet profiles overlap, which is considered to be 100% accuracy.?

The profile accuracy D(a|f) is greater than zero and equals zero if and only if ai = fi, for all i.?In other words, when the forecast pattern is identical to the pattern of the actuals.

For an initial exploration of trend/seasonal profiles, I have selected the 1428 monthly series from the M3-competition data with seven of the 24 Methods most commonly used, and easily reproduceable so that additional forecasting cycles can be analyzed with the data. These M3 methods were designed to produce level, trend or trend/seasonal profiles over a forecast time horizons.

In addition, I am including three modified methods to make the analysis more objective and more readily reproduced:

No alt text provided for this image

1.??????Na?ve-1, has a level forecast profile generated from latest actual repeated as a forecast over the 18 monthly holdout time horizons.?Unlike Na?ve-1, Na?ve- 2, has been re-seasonalized with factors that are not readily reproduceable from the documentation.

2.??????SES is a simple exponential smoothing model that also has a level profile. M3 SINGLE is a seasonalized SES not readily reproduceable from the documentation.

3.??????Holt (2) is the algorithm with a straight-line forecast profile as defined by the original Holt Method. M3 HOLT is a seasonalized Holt Method

Point Forecast Accuracy May Not Be Adequate for Assessing the Performance of Forecast Profiles

While striving for the best method may not be feasible or seen as a best practice, it is not uncommon for lead-time forecasting to get misinterpreted as a multiple, one-step-ahead point forecasting approach, created with moving or rolling origins. This is not the same as a one-step ahead lead-time profile forecast from a fixed origin and predetermined time horizon. With point forecasting and a moving origin, overrides and management adjustments are commonly made over the horizon period and it is then a best practice to use MSE, MAE, MAPE, sMAPE, MASE, and variations thereof for performance analysis. but this may not appropriate for lead-time demand forecasts.

Identifying Effective Methods for Lead-time Forecasting

W need to recognize that there can be no best forecasting method, only useful ones as uncertainty always enters into the fray as a certain factor driving demand. In a relevant quote, attributed to the world-renowned statistician George E. P. Box (1919-2013), it is worth remembering that: “All Models Are Wrong, Some Are Useful”.?

No alt text provided for this image

For purposes of this data-analytic exploration, I have paraphrased it to say “All Data are Wrong, Some Are Useful”, which applies to the lead-time forecasts in the M-competitions, as well.

Using a proper skill score, we can rank every series forecasted by a Method with the D(a|f) accuracy measure for forecast profiles. I define a Levenbach L-Skill score by?1 - [D(a|Method)/D(a|Benchmark])]. The L-Skill score ranges from minus infinity to +1.?The Na?ve1 and Naive2 benchmarks have L-Skill scores = 0, by definition.

Positive L-Skill scores are associated with an effective profile forecasting Method. The Methods with the highest percentage of positively contributing profile forecasts should be of interest.

The first two columns in the table below show the results for seven M3 Methods using the Na?ve-1 (N1) and Naive-2 (N2) as benchmark methods. Thus, Naive-1, as a Method, has only 16% better forecasts than the Naive-2 benchmark Method. Likewise, the Naive-2 Method is 49% more effective than the Naive-1 benchmark Method. For the other Methods, the percentages are comparable, but note that the series calculated with sMAPE 'accuracy difference' scores do not have identical distributions between benchmark methods (see scatter diagram below).

No alt text provided for this image

How to Associate Effective Methods with Skill Scores

The M3 competition organizers calculated a spread between accuracy measures as a means of ranking Methods. In particular, the sMAPE(Na?ve-2 Benchmark) -?sMAPE(selected Method) spread was calculated and averaged over all 1428 series to rank Methods with a simple average. The box and whisker plots in the left diagram show the distribution of the 1428 “sMAPE accuracy differences” for the THETA Method with the two benchmark methods.?

No alt text provided for this image

The series with positive spreads contribute to the effectiveness of a Method. The simple averages of the sMAPE accuracy spreads are 3 and 4, respectively (left box and whisker plots), and the median of the sMAPE accuracy spread is 2 for each benchmark. However, with the very skewed distribution of the underlying numbers, I have reservations that differences in arithmetic means calculated from these data would yield much meaning by ranking Methods. As seen on the right scatter diagram above, the series identified with positive accuracy spreads are not the same for each benchmark method, either. Thus, many series can be effective contributors (positive spreads) when using one benchmark method and not with another (contrast quadrants I and III in the scatter plot).

No alt text provided for this image

Note, in the left scatter diagram above, that the 'spreads in accuracy' are a function of the benchmark method used. Like a regression scatter plot, the diagram shows the effective 'sMAPE(Na?ve-2) -?sMAPE(THETA Method)' spreads versus sMAPE(Na?ve-2) point accuracy for the M3 monthly series with the 18-period holdout sample. So, simple averaging does not tell the right story. Rather than using spreads in accuracy measures, we should be using ratios instead. On the right scatter diagram, it shows the relationship between the sMAPE accuracy spreads with the two benchmarks for the entire 1428 monthly dataset.?They are not closely related. This leads to using ‘proper’ skill scores for performance analysis.

We can get further insights into effective Methods with proper Skill scores. The sMAPE Skill score is given by [1- sMAPE(selected Method)/sMAPE(benchmark)], which has a numerator that is the same as the accuracy spread : [sMAPE(benchmark) -?sMAPE(selected Method)]. Using the Naive-1 and Naive-2 benchmarks for comparison, we find a 72% effectiveness rating for the THETA Method in this, single static planning cycle evaluation. The benchmark matters if you need to make comparisons. In a comparison using the L-Skill score, the percent effective is about the same (68 - 69%), although the scatter diagram suggests that they are not necessarily closely related. For comparison, I have added the MAPE Skill score and the RAE Skill score to the table.

No alt text provided for this image

In an earlier data exploration, only three of the 24 methods were used on the 1428 monthly time series with the 18-month lead-time holdout samples. Method A was shown to be effective for about two-thirds of the time series, as were Methods B and C (not used in the Table above).?Getting experience with dynamic lead-time forecasting cycles is essential for a demand planner to determine whether a change in effectiveness ('doing the right things") rankings are important in a particular context.

What Does Data Exploration Reveal?

With experience, we can get smarter at the task and more agile in the forecasting process when we monitor data quality throughout the entire forecasting process, i.e., before and after using a lead-time forecast. In an article posted on LinkedIn and Delphus.com, I demonstrate the importance of identifying and correcting even a single outlier in a highly seasonal, readily forecastable series (M3 series N2796). This was not the only instance in which I found consequential problems with isolated outliers.

Paraphrasing Ed Deming, I find that "bad data beats a good forecast every time".
No alt text provided for this image

N1906 is an M3 series that has an L-Skill score of 0.99, the highest encountered among 19 out of the 24 methods. As you can see in the table below, the series N1906 is ranked high, if not the highest, for Methods using Skill scores based on more familiar accuracy measures.

No alt text provided for this image
No alt text provided for this image

It may also be useful to examine the data visually, so that you can understand the difference between point forecast accuracy, based on vertical distances between actuals and forecast, and profile accuracy measurement with D(a|f). For example, N2519 also has a high-profile skill score for a number of Methods but may end up with quite different results using accuracy measures based on absolute differences between actuals and forecasts.?

Practical Takeaways

  1. To validate effective forecasting methods for demand planning, one should follow transparent empirical findings that complete dynamic, multiple profile forecast cycles over a predetermined lead-time (time horizon) with real-world data.?Then, calculate and calibrate the skill scores.
  2. It is smarter forecasting, in my experience, when you understand what forecast profiles get generated from a Method than specialize in the details of how the model was estimated. You will achieve greater agility and productivity in your demand forecasting process by not getting too tied up with tweaking parameter estimates and “optimal fitting” issues.
  3. The spreadsheet environment is more than adequate, and perhaps more flexible for data explorations than established commercial systems. One generally requires simple operations and visualizations for which the spreadsheet environment shines. Also, the open-source language, like R is a free, modern and convenient software tool to analyze and compare forecasting results.

If you want to contact me at [email protected], I will be happy to share the spreadsheet used for the calculations in this article. The M3 data are freely available in a convenient format (csv) for use in Excel and R. A paper (pdf) referencing the M3 Competition results can be downloaded as well.

Temper your trained Gaussian Mindset and remain skeptical of simple averaging as the best means of comparing things. Central tendencies may depend a lot on how the rest of the numbers are distributed in your measurement and correlated with corresponding values in a different measure.

No alt text provided for this image

Hans Levenbach, PhD is Owner/CEO of Delphus, Inc and Executive Director,?CPDF Professional Development Training and Certification Programs.

No alt text provided for this image

Dr. Hans is the author of a forecasting book (Change&Chance Embraced) recently updated with the new LZI method for intermittent demand forecasting in the Supply Chain.

With endorsement from the International Institute of Forecasters (IIF), he created CPDF, the first IIF certification curriculum for the professional development of demand forecasters. and has conducted numerous, hands-on?Professional Development Workshops?for Demand Planners and Operations Managers in multi-national supply chain companies worldwide.

No alt text provided for this image

The 2021 CPDF Workshop manual is available for self-study, online workshops, or in-house professional development courses.

No alt text provided for this image

Hans is a Fellow, Past President and former Treasurer, and member of the Board of Directors of the?International Institute of Forecasters.

He is Owner/Manager of these LinkedIn groups: (1)?Demand Forecaster Training and Certification, Blended Learning, Predictive Visualization, and (2)?New Product Forecasting and Innovation Planning, Cognitive Modeling, Predictive Visualization.

I invite you to join these groups and share your thoughts and practical experiences with demand data quality and demand forecasting performance in the supply chain. Feel free to send me the details of your findings, including the underlying data without identifying proprietary descriptions. If possible, I will attempt an independent analysis and see if we can collaborate on something that will be beneficial to everyone.
Akram Khan (MA KAN)

Sr. Operations Admin_FedEx

3 年

Awesome, thanks for sharing, have wonderful mother day!

要查看或添加评论,请登录

Hans Levenbach PhD CPDF的更多文章

社区洞察

其他会员也浏览了