Believe it … or not!
Introduction - the need for "sensing"
In the previous post we discussed design principles that can provide built-in resiliency to disruptions. We saw that it invariably requires making a tradeoff between top KPIs. The more we can reduce the range of uncertainty due to disruptions in design variables, the more benign the tradeoff we can wrangle.
One of the urgent challenges in supply chains today is to be able to predict disruptions in variables that cause misalignment of demand and supply.?While prediction science is by no means new, the type of prediction techniques we need to use in supply chain analytics are somewhat unique due to the nature of the problem space. Supply Chain disruptions, by nature, cannot be predicted well only from behavior seen in the past, and therefore a new approach needs to be taken. Today I would like to talk a little bit about the idea of "sensing", as in "demand sensing" and "supply sensing", which is generating quite a bit of excitement in the field. This idea has close analogies to the concept of "diversity reception" in wireless communication which yielded huge success in the 2000s and enabled the 3G/4G/5G revolution in mobile communication. It is also related to another very successful technology called "sensor fusion", where we combine observations from multiple sensors (magnetometer, accelerometer, gyro, illumination, LIDAR, Camera) and radios (NFC, Ultrawideband, WiFi, GPS, Cellular) for achieving highly accurate location detection and tracking, as well as more reliable driverless cars. We can leverage that experience and avoid reinventing the wheel!
Let's say we are a company that manufactures shoes, and let us concretely think of an unknown random quantity of interest - the demand that we will see next month for one of our flagship products - a sports shoe for cross training - denoted by X. We know from historical record that the demand take values (in units of 1000) from an alphabet {0,1,2, 3, ...} and we have an a-priori belief about how likely each value is. This belief, typically produced by a time-series prediction model operating on past values, accounts for the long term average, variability, secular trend, seasonality and other temporal dependencies in demand. The belief is expressed as a probability mass function:
If we have no further evidence to look at, our best guess of X will be based entirely on this a-priori belief. Depending on what our criteria of "best" is, we have various kinds of estimates. For example, to minimize the squared-error we must use the mean value under the belief. For minimizing the probability of wrong prediction, we choose the location of the highest mode. It is a fundamental theorem that these belief based estimators are in fact globally best - no other estimators can beat them.
Unfortunately, in many cases, history can only give us a limited amount of predictive power. So things start getting really interesting only when we observe new external evidence (sometimes called "side information") that potentially tells us something more about the random demand X. This evidence does not come from observations of the demand in the past, but rather from observation of ancillary variables that either tend to cause modulations in demand (for example product promotions executed by management) or are a reflected effect of changing demand (for example product related sentiment on social media). Both causes and effects have predictive power, and the probabilistic machinery used for prediction is agnostic to cause-effect relationships.
In general our belief about X strengthens, in a specific sense we will discuss shortly, with each new piece of evidence we observe. In fact if we keep processing more and more diverse and informative pieces of evidence, our belief can evolve to high certitude, as shown in this animation:
We show three trials of the random demand X and show how observation of evidences progressively evolves the belief, starting from zero evidences (i.e. just having a diffuse bell-shaped prior). Even qualitatively, you may observe some interesting things about how the belief evolves - not monotonically, not to perfection, not at the same speed in each trial. In the rest of this post we will unpack the mathematics of how all this works and explain these qualitative observations.
The manipulation of beliefs
Any evidence we observe is also a random quantity Y and we observe some specific realization Y=y. All the knowledge about X encapsulated in the observation Y=y is captured in the likelihood of X given by
It is crucial to view this as a function of the free variable x. The fundamental equation of probability theory tells us how to combine a-priori belief and the likelihood to produce an a-posteriori belief as follows:
A-Posteriori Belief = A-Priori Belief × Likelihood
Believe me, this equation is a rock star, and even after so many years of using it professionally, I marvel at its simplicity and interpretability. Mathematically it is written as
The equation is derived in an elementary way from the Bayes' law of conditional probability of events. But the simple derivation belies its deep significance.
The first thing to note is that this is a relationship between functions. The a-priori belief is a function of the placeholder demand variable x . The likelihood under the observation Y=y is another function of x. The belief after observation of evidence Y=y, is yet another function given by the product of the a-priori belief and the likelihood. The first form of the equation is the strictly correct but cumbersome way of writing, so we abuse notation and write the second simpler form. It is important however to keep in mind that the three terms, all denoted by p(.), are in fact distinct functions, inferred from context. You may also notice that rather than using an "=" sign, we used a "~" sign in the equation. It is because we are leaving out a normalizing constant that is independent of x, which ensures that the belief about x will sum to unity.
Secondly, there is no qualitative difference between the a-priori belief and the a-posteriori belief. The a-posteriori belief should be used in the same way we would use the a-priori belief. For example, to minimize the squared error after observation of evidence, we would guess X to be the mean value under the a-posteriori belief, and so on. Connecting with our earlier comment, these are also the globally best estimators given observed evidence - no other estimators can beat them.
What happens when we have many pieces of evidence? How does that affect the belief? Let us say we are observe two pieces of evidence, Y1 and Y2, one after another. For simplification purposes, let us stipulate that the pieces of evidence are statistically independent of each other conditioned on X. This means that given X, there are no other hidden ("confounding") dependencies between them. This may seem like a hard condition but in fact it is not, and we will discuss this later. Mathematically this means that the likelihood under the joint evidence admits a product form as follows:
领英推荐
which allows us to write the a-posteriori belief after observing both pieces of evidence as
That has a nice iterative structure. We start with our a-priori belief, and with each piece of new conditionally-independent evidence we keep multiplying our current belief with the likelihood under the new observation. This means that the shape of the belief function keeps changing with each new observation.
It is apparent that the order in which we observe the evidence is of no significance - we end up with the same final belief. Secondly, "new" evidence is as good as "old" evidence - nothing should be thrown away. Lastly, the computation of a-posteriori belief under thousands of conditionally independent pieces of evidence is "embarrassingly parallelizable"! All the multiplicative terms can be computed in parallel, and if any of the evidences are missing during a particular trial we can simply drop the corresponding multiplicative terms from the RHS of the equation.
As we observed qualitatively in the animation, with more and more pieces of evidence, our belief about X gets stronger on average. (Here averaging is done over all random quantities X and Y.) But what precisely do we mean by "strong"? The strength of belief is measured by a functional called Entropy. High entropy means large uncertainty, hence weak belief. Low entropy means less uncertainty, hence strong belief. Zero entropy means no uncertainty, hence perfect belief. It is an elementary but beautiful result of Information theory that, on average, conditioning can only reduce entropy. So observing more and more pieces of evidence can only make our belief better on average. I keep adding the phrase "on average" because in any single trial the strength of belief can go up and down with additional observations, as you saw in the animation.
"Additional evidence helps improve the strength of belief" - That has a nice egalitarian ring to it. But don't get too carried away! We are not guaranteed to reach perfect belief even if we keep observing more and more pieces of evidence. For that to happen, the pieces of evidence have to be informative enough to drain away all the uncertainty, which is rarely the case in any practical application. The amount of reduction in entropy that a new piece of evidence will provide (the "conditional mutual information") will in fact depend on what evidence has been previously incorporated into the belief. There are multiple ways of draining away the uncertainty - some ways may be quicker than others - but we will end up in the same place after all evidence is considered.
The desirability of conditionally independent evidence
Now let is return to the simplifying assumption we made about conditional independence. If we did not make that assumption, a belief evolution equation still holds but it gets complex:
Each subsequent term on the RHS is a progressively higher dimensional object that must account for earlier observations, making it more difficult to manipulate. Moreover, the fully parallelizable aspect is lost! If any one of the observation terms is unavailable in a trial, we have to compute an entirely different expression. So, in my opinion, it seems inappropriate to call conditionally dependent observations as "sensing". Rather, the pieces of evidence that are conditionally dependent should be treated as a single meta-piece of evidence. For example, we may get several types of sentiment observations about our product (from web search, Facebook, Instagram, Twitter, and traditional media) and there are likely going to have strong conditional dependencies between them. So we need to calculate the likelihood under those evidences taken together. On the other hand, it seems reasonable to treat social media evidence as independent of product promotions because the latter are usually planned by marketing and business planning for reasons unrelated to social media sentiment.
In practice, it is sufficient to have approximate independence - the belief calculation will still work fine. Why these "mean-field approximations" work well is a deep topic involving information geometry.
One final point why the conditional independence assumption makes sense. Collecting a lot of conditionally dependent observations is not such a great use of resources because those additional observations do not reduce entropy as much as conditionally independent observations do. So when deciding what to "sense", look for diversity of evidence, not comprehensiveness. A lit bit of social media signal, a little bit of promotions information, a little bit of weather forecast, a little bit of geo-political signal ... and we can bake a beautiful cake!
Machine Learning
In all the preceding discussion, we have used the probability expressions for priors and likelihoods assuming we know them. But how do we know them? Since there are no "laws of physics" to appeal to in supply chain operations, we need to be fully data driven. We need to learn those functions from data, typically using machine learning. For example the likelihood can be learnt as a simple supervised model whose input is the evidence y and whose output is the likelihood p(y|x). As noted above, y can be a complex tensor of multiple conditionally dependent variables - aka "ML features".
Deep Learning models are especially suited to this task of likelihood estimation because they naturally produce beliefs. We train the model with many (y, x) exemplars previously observed. We can also take plentiful unlabeled data (only exemplars of y) and ask human experts to label them with x, based on their domain knowledge. In any case, with sufficient training, we can produce fairly good representation of the likelihood functions for the various sets of independent observations, and use them in the belief equation.
Hypothesis testing
Finally I would like to close this post by noting that everything we said about guessing an unbounded scalar variable like demand, applies equally well to estimating any other random variable of interest such as an M-ary categorical variable or even a binary hypothesis.
For example suppose we are concerned about the reliability of a particular supplier who supplies a part used in our shoe. Is that supplier in imminent danger of being disrupted, leading to interruption of our production? In such a hypothesis testing scenario the alphabet is {H0:not imminent, H1:imminent} and we are trying to decide which of the two hypotheses is true. In this case too we may have a prior - the historical reliability of that supplier. But the prior is not enough since the best suppliers may run into trouble during the pandemic related labor shortages. So it can be very useful to sense ancillary information such as their publicly reported financial data, social media sentiment about working conditions in their factories, and pandemic infections rates in the regions where their factories operate. The belief about the hypothesis then evolves with these multiple sensed evidences, as shows in the following animation. Again note the oscillations and gyrations.
When the belief about H1 reaches some high threshold such as 95% (as happened in trial-3 above) we declare that "H1 is true with statistical significance". That means we may trigger an alert that the supplier is about to be disrupted and take rapid mitigation actions such as finding another supplier or reducing the dependence on their part which will be in short supply. By doing this in a timely way we can hopefully avoid the worst of the impact from the disruption.
Principal Solutions Architect: Global Supply Chain and Manufacturing Consulting Practice at Microsoft, IOT, AI, ML, AR/VR, Azure Cloud Solutions, Biz APPs, Power Platform, Digital Twins
3 年Nicely expressed, Anand! It looks like the work we did together on identifying risk elements in supply chains is being put to good use to help our customers gain a little more predictability.