Managing by Exception, the Right Way
Oftentimes I get the question "how many SKU's should a planner be able to manage?". Depending on the company they may be averaging anywhere from 100 to 10,000 SKU's per planner. My usual response is "that is the wrong question!". The right question is "how many exceptions should a planner be able to manage?". The wide range of SKUs-per-planner is testimony that the original question is wrong.
Surprisingly, when considering only those companies that claim to manage by exception that range does not narrow much, although they tend to gravitate to the top half of the range. Does that mean that the number of exceptions by planner is also the wrong metric?
Here, I will make the case that, no, it is the right metric. But, most companies that believe they are managing by exception are actually managing by alert. And managing by alert will help, but not much. Alert-driven takes a planner from 1000 SKU's to 10,000 SKU's. When planning systems are truly exception-driven the same planner should be able to handle 100,000's to 1,000,000's of SKU's.
What is an Exception?
To answer why so many companies believing themselves to be exception-driven are not seeing stellar planner productivity we need to go all the way back to the basics. We need to properly define what an exception is, and what it is not. And we need to do so in a practical way. Practical meaning that it aligns with planning processes and systems that are in place and their purpose.
Colloquially, an exception is something that you did not expect would happen, but it happened anyway. Alternatively, if we make rules, an exception is a violation of the rule. These are not the same concepts, but both have validity in planning. More formally:
We will need to treat these two types of exceptions differently. More on that below. But first, let's see how they relate to systems and processes.
When companies implement a new system there are many tradeoff decisions made weighing the complexity and value of including aspects in the solution against how often or bad the situation will be if the system does not automatically deal with it. For example, if it takes a month to implement a functionality that will address a situation that happens a few times a year and takes an hour to address manually when it does, it is not value-added to implement it. The net of it is that it is never a question of whether any situations are not covered by any solution, but rather how many. All those lead to type 2 exceptions.
Type 1 exceptions are less caused by system limitations and more by process limitations, often not fully within our control. If a customer runs a promotion without notifying us the resulting problems are due to process issues. If supply is held up in customs or a supplier factory burned down, whether that leads to a problem depends on how resilient our processes are. Can we ramp up capacity quickly, can we find alternate short-term suppliers, and so forth. These are all type 1 exceptions of various degrees. A customs holdup is less exceptional than a factory burning down. This is a key characteristic of type 1 exceptions, they can be quantified on a continous scale. Whereas type 2 exceptions either are or are not.
A common misconception is that outliers are exceptions. Some outliers are, but
most outliers are not exceptions!
Outliers are usually problems with our ability to measure. Not problems with the thing we measure.
Where is the Value?
Depending on the type of exception - or non-exception - value can be achieved in different ways and different amounts. Type 2 exceptions tend to be average in every way. They occur somewhat consistently, cause medium amount of disruption, and require medium amount of effort to address. Type 1 exceptions can range from moot to existential threats and everything in between depending on how well we can prevent, identify, and combat them. Chasing outliers is generally how planners waste most of their time that they should have spent dealing with real problems instead. Enormous efforts are exerted this way for very little value.
Let's start with the easy case, type 2 exceptions. These are caused by design decisions. If the value assessment was done properly during implementations and the situation on the ground has not changed significantly, these type 2 exceptions should be low effort to manage. In the crunch of an implementation project however addressing expected situations of a solution may have been descoped. Whilst this type of exception should have limited impact, poor design or poor implementation may cause many small impacts to accumulate to large recurring efforts. Ultimately this provides a baseline of planner inefficiency that can not be improved upon. Death by a thousand papercuts.
If these exceptions are not manageable the solution is easy. It may be cumbersome, expensive, and time-consuming, but easy. Just fix the hole in the solution, or if that is not possible, replace it with a solution that does address it properly. In practice, there may be all kinds of budgetary, capacity, or political reasons why fixing or replacing a solution is not possible. In such cases, the company will need to accept a base level of planner inefficiency.
There is huge risk there though, since type 1 exceptions will occur at unexpected times no matter what. If the planners are already overwhelmed continuously addressing fixable type 2 problems they have no bandwidth left to properly monitor, let alone deal with, the real problems when they occur.
The really tough and disruptive exceptions are predominantly of type 1. Depending on how well these are addressed - and assuming the easy type 2 exceptions have been taken care of - the number of SKU's a planner can manage could range from 1,000 to 1,000,000 for the same portfolio. Type 1 exceptions are where the money is at! The big problem is
how do you identify them?
Real type 1 exceptions do not in general occur with high frequency. But in an ocean of data how do you discover one is going to happen or is happening? The typical solution is to look for outliers. The problem with this approach is that most outliers are not exceptions, and most outliers signal a problem after it has already occurred. This means the planner will see a list of candidate exceptions that consists predominantly of false positives. These are alerts that are not exceptions. To reduce the number of false positives, sensitivity of the outlier detection can be reduced, but this leads to an increase of false negatives. These are real exceptions that the planner never gets alerted to, then they blow up and become crises.
In general, planners waste a lot of time checking out hundreds of false positives for every one real exception, hours to days each week. Then they waste even more time fighting a single crisis when a false negative flew under the radar and blows up. As long as outliers are used as a proxy for exceptions there is no way out of this conundrum. You can reduce one by increasing the other and vice versa. This forces an inefficiency near the low end of the SKU's-per-planner range. It does not matter if the outlier bounds are determined by trial-and-error or through some theoretical number of standard deviations, the conundrum is the same.
As an example of the outlier approach, many companies monitor how large the difference is between forecasts and actuals. The simplest approach is to set some percentage range outside of which an alert is generated. For example, alert the planner if actual demand deviates more than 25% from the forecast.
Figure 2: demand (green) vs forecast plus a 25% "normal" range (blue). Outliers are actuals that fall outside the blue area.
领英推荐
The problem is for some items anything outside 5% may be a true exception (causing false negatives) whilst for other items occurrences greater than 100% may still be commonplace (causing false positives).Since planners are generally clever and creative, some will take a few different groups of items and provide a different alert threshold for each group. For example, if SKU's are classified in an ABCXYZ scheme, X's may be assigned a low threshold, Y's medium, and Z's a high threshold. This does generally help somewhat, but is a bandaid on a severed leg. Not all X's are the same, not all Y's are the same, etc. No matter how you group, you will always need to assume all members of the group are treated equally when using the outlier approach.
All this waste of time is not only costly for the company, it is some of the most mindnumbing work imaginable for the planners. And it takes away valuable time the planners could be using to prevent real problems and to add value in a myriad of other ways.
The Right Approach
If outlier detection is the wrong approach, what then is the right approach? For that we need to go back to the definition of an exception:
An exception is something that is unlikely to occur, yet it does.
This reeks of probabilities. We need to determine the probability of something occuring, and for very low probability events monitor if they happen. Naturally, that can be reduced further by only caring about those types of events that could have significant impact. Again, there is a probability angle there. Some events will never have a big impact, some will always have a big impact, and some may or may not have a big impact. We can hyperfocus on those events that are both highly improbable to occur and have a sufficiently high probability of having a high impact when they do occur.
If we apply this to the example in the last section we can see how out of the many outliers identified only two were real exceptions, and how one exception was missed completely.
Figure 2: Top graph: demand (green) vs forecast plus a 25% "normal" range (blue). Outliers shown as red circles. Bottom graph: true exceptions shown (red), their columns marked in pink for visibility. Red diamond is false negative: exception, but not an outlier.
For a probabilistic plan or forecast this is a natural fit. Since every uncertain quantity, rate, yield, lead time, and so forth is already expressed as a whole probability distribution finding exceptions is exceptionally easy, pardon the pun. Not much work is needed to translate probabilities of predictions to their conditional probabilities based on actual values. If the probabilistic model is accurate, which should be a minimum requirement for any such model since that is their raison d'être, false positives and false negatives should be almost non-existent. This means planners can indeed hyperfocus on only true exceptions and prevent them becoming crises. Then have a lot of extra time to provide value-add they otherwise could not. (if you do not know about probabilistic plans and forecasts check out this primer)
For traditional (deterministic) plans and forecasts, identifying true exceptions is harder, but not impossible. The key differentiator between probabilistic plans and deterministic ones is that the latter assume either no distribution (Dirac) or a stationary normal (Gaussian) one, where the former makes no such assumptions. The result of this stationary/no/normal distribution assumption is a dramatic loss of accuracy of the plans and forecasts. This in turn leads to an order of magnitude greater number of exceptions that will need to be addressed. For plans this is inescapable without replacing them with probabilistic plans. For forecasts some mitigation is possible. Then, by monitoring for true exceptions instead of mere outliers, we can prevent the effort from growing by another order of magnitude.
The solution is to avoid the naive assumptions that went into creating the plans and forecasts when measuring their results. But otherwise it requires mostly the same statistical techniques used when searching for outliers. The plans and forecasts are deteministic and cannot be changed to make them probabilistic. But for plans we can go back and assume historical actual values are random variables and similarly for forecasts that the error residuals are random variables. For both we will need to explicitly relax the assumptions that they are unbiased, normal, and stationary. When we have lots of data we could use empirical distributions, but to get reliable results in general we will need to assume some more applicable or less restrictive parametric distribution.
This is usually where the hard questions begin. The prime one being: which distribution do we use? The unfortunate answer is: it depends. It will generally be a tradeoff between finding a distribution that is versatile enough that it can cover all the possible cases and one that is so specific it is easy to work with. An example of the former is the Tweedie distribution. It will be able to fit almost any planning situation, but estimating its parameters is incredibly tough to do well. An example of the latter is the normal distribution. In practice, it is usually more work, but a lot easier, to lean to the latter end of the range. Just do not go all the way to the normal distribution unless there is very clear evidence it is warranted.
For most quantity data (such as demand) and lead time data the easiest starting point is assuming a Gamma distribution and for most count data a Poisson distribution. Both these distributions tend to far outperform the normal distribution, and are just as easy to work with. Whilst many claims have been made that the normal distribution is a good fit for most data, reality is that it rarely occurs in real-world supply chains, at least at the granularity where the distribution matters. The largest distribution errors also occur in the tails of the normal distribution (the extreme values), and it so happens that is exactly where all exceptions live. Whilst the normal assumption is damaging when creating plans or forecasts, it renders exception detection utterly useless.
So for a quick win, Gamma and Poisson are good starting points. But beware, they may still be far from the true distribution. So as your understanding of uncertainty, distributions, and exceptions matures you will want to experiment and test for better distributions to further reduce the errors in exception reporting. These distributions do need to be applied differently for residual errors. Note that both are bounded by zero (no negative values). Differences can be negative, so taking those is not an option. Instead, you could assume the forecast is the mean of the distribution. Even better, you could correct the forecast for historical bias for an unbiased estimate of the mean. For example, if historically the forecast was 5% too high then divide the forecast by 1.05 and assume that to be the mean for future distributions. For Poisson, the entire distribution is now determined. For Gamma, you would still need to determine the variance. You can choose to assume constant variance (not quite non-stationary, since the shape will change) or possibly some scaling with the forecast if the data indicates that is a better fit.
Finally, how do you measure an exception when you have fitted some non-normal distribution? This one is easy: just take some high enough value on the cumulative distribution function (cdf) if looking for exceptionally high values, or low enough on the cdf if looking for exceptionally low values. Which value you pick will be portfolio-specific and will tend to be planner preference-specific. Often values like 0.99 or 0.995 lead to a manageable planner workload, but lower values may be necessary if the assumed distribution is not close enough to reality, to avoid expensive false negatives.
NOTE: All the above covers the cases where unexpected behaviors have already occurred or are far enough in occurring to be visible in historical data. Predicting exceptions before they happen is a whole different problem and out of scope for this article.
In Summary
So, we have two main types of exceptions with two different ways to tackle them. Type 2 exceptions provide a baseline inefficiency, are easy to fix, but require investment to do so. Type 1 exceptions are highly disruptive and much more difficult to identify if the plans and forecasts themselves are not probabilistic. These require learning a different paradigm and proper analysis of your own data but are not reliant on executive approval of a big project. Ultimately, both will need to be tackled to achieve best-in-class planner performance.
Most companies will have some data where alerts are generated based on heuristically determined hard bounds on values and other data where alerts are generated based on standard deviations. Both will lead to mountains of false positive alerts and false negatives, where an alert should have been given, but it wasn't. A change of paradigm from quantity-based alerts to probability-based exceptions is required to get out of this bog.
Most planners will have seen great benefit when they made the move from having no feedback of extreme data points to having automated alert reports. This is like having a haystack with a million hay stalks and ten needles, then getting a report with a thousand results, 7 of which are needles. Most definitely an improvement, but the 993 false positives and 3 false negatives still hamper productivity. A true exception-driven result would be a list of 10 items, all of which are needles. In practice, there will always be some errors both ways and planners will aim to err on the safe side and maybe choose to see 20 results to ensure no false negatives occur (all the needles are found).
Many companies believe their systems are exception-driven. Even many commercial software vendors will claim their software is exception-driven, when in reality it is merely alert-driven. This article illustrated how false positive alerts drain planners' time chasing wild geese. And false negatives are the alerts that should have been provided, but never were. These then lead to crises where multiple departments get involved to salvage the situation. These hamper planner productivity to such extent they can only manage a fraction of the SKU's they would have been able if they had real exception notifications. All the other, unmanaged, SKU's are left to erode your margins. With this article it is my hope you will be able to assess if you are dealing with true exceptions or merely alerts, and provide guidance on how to become truly exception-driven.
If you are interested in probabilistic planning and forecasting please consider joining the?"Probabilistic Supply Chain Planning" group?here on LinkedIn.
Find all my articles by category here. Also listing outstanding articles by other authors.
Creating long-term partnerships to maximize supply chain planning value.
3 年Interesting thoughts here. I would be curious on how you would prioritize these true alerts. Because at the end of the day, you might only have the time to solve the most critical ones.
Change Management | Business Continuity | Process Improvement | Digital Transformations | PMO | Project Management | Data Science | AI | CRM | Supply Chain | Crisis Management
3 年To the point Stefan de Kok !
Decision Intelligence | Intelligent Agents | Supply Chain Management
3 年Stefan de Kok great observation that companies manage by alert rather than exception, resulting in significant non-value add work and wasted time.
End-to-End Supply Chain Technology & Digital Transformation SME. Optimizing Supply Chains, Empowering Teams *Trusted Partner in Business Success and Value Delivery*
3 年and not only seeing exceptions, but clear identification of which ones you can do something about- and the $ value of that action to the organization. Next level!
Choose Your Path or Take Your Chances | Let's Talk About Creating Effective Demand Planning Processes To Drive Profitability
3 年One more thought: I would add Type 3 exceptions. These are exceptions that should be recorded but where no action should be taken. An example from my own work: We had a customer order 10,000 units of a specific storage unit for a military deployment. It was definitely an exception in that it was unexpected. But it occured only 1x, so we took no action to react to this spike. We did, however contact the customer to ask if this would be a recurring purchase, and the answer was no. So apart from contacting the customer we made no changes to our ordering or forecasting.