Random, Stochastic, Probabilistic
Gerald Gibson
Principal Engineer @ Salesforce | Hyperscale, Machine Learning, Patented Inventor
At the end of the previous article it was mentioned that we would show how, from a computer programming perspective, "probability mass functions" were simple. Before we dive into that lets add some clarity around more terminology. Most, if not all, of statistics is centered around the idea that we rarely have exact answers to any question. If we end up with a single number answer it actually was selected as the most likely out of a list of other possible answers. This concept is a little strange from the computer programming point of view. Generally we have input variable(s) into a function and then there is an output. We as developers go out of our way to make the system as deterministic as possible. If we send a value of "5" into our function and then we get an output of "12" then we expect to get a "12" every time we have a "5" as our input.
In statistics, functions do not work this way. By default it is assumed there are multiple possible answers and so the whole workflow is setup so that our inputs are a list possibilities (each with a percentage probability value assigned) which results in a list output (each list item with a probability value assigned). This is not very apparent when looking at the math itself with all the Greek letters either.
The math makes it look like it is operating on singleton variables. In reality each of those variables often represent a list (i.e. array or vector) of values and it is just assumed everyone understands that the math is going to be computed in a loop over each of the values in the list. When you see a little "i" (or some other lower case letter) hanging off the end of a variable in the math it is usually referring to an index into a list of values.
So we often see the words Random, Stochastic, and Probabilistic in the statistics world. These essentially are referring to the same thing. Almost nothing is 100% certain and so solutions to problems are approached in a way where many possibilities are arranged into a list and each possibility gets probability (percentage) values assigned. Then they are computed one by one. If there are multiple input list variables then value "i" for each list is sent through the equation together as you loop over the lists. The output is often a list of probability values that we then select the highest or lowest value from and take that as our answer. When we see the words maximization or minimize they are referring to selecting the highest or lowest value from a list of outputs. For example "expectation maximization" where "expectation" means average and maximization is referring to selecting the highest averaged value through each iteration.
Probability Mass Functions work this way. There are multiple input variables each of which is a list of one or more values. Many times you will see in the description of the math the term "PMF" referring to the operation itself and the output of that operation. This seems common in the math world. In reality the PMF is the operation and the output is a "distribution". What is a distribution? It is a list of possibilities each with a probability value assigned. Probability Density Function (PDF) is the "analog" or "continuous" value version of PMFs which use discrete values. Continuous values are decimal values that can go to any decimal length such as Pi (3.14159265358979...) and discrete values are categories or numerical values that are "snapped" to a finite list of possible values (often integers).
PMFs and PDFs can be visualized with graphs. Curvy lines for continuous PDF output values and blocks of various heights (like histograms) for PMFs. Here we can see both in one graph. The blue blocks are the PMF values and the curvy orange line represents a continuous PDF value output.
The input and output data for a PMF are variables that represent a list of possibilities and assigned probabilities.
In this example we can see "Ranges" which are actually categories (15-20 is a category and so is 40-45). When you have a list of probability values they sum up to 1 (or 100 if in percentile format). In histogram format each category would be a block and the block's height would be determined by the probability value.
Now for some code to see PMFs in action in a few different ways. As said before, PMFs have multiple input variables each of which is a list of one or more values. In some cases a variable has only one value and the other variables have multiple. Which you choose depends on the scenario or the question you are trying to answer.
Lets say there is a new manufacturing process being setup and you are tasked with figuring out over the long run how many defects can be expected. You will be allowed to run the assembly line many times to gather the input data you need so you can predict under various scenarios what can be expected. Your results will be used to decide how the company will run this part of it's business as they ramp up to outputting many millions of pieces.
So you spin up the assembly line and you produce 100 pieces. Then you do this again and again 10 times. You have each piece inspected and find that on average you can expect 2% (0.02) of the pieces to be defective.
The question you now will answer statistically is... in the long run (on average), what are the probabilities for getting 0,1,2,3,4,5,6, or 7 defects each time you manufacture 100 pieces?
领英推荐
Because this is about an event that is binary (like a coin flip) i.e. each piece manufactured is either defective or not we will use the "binomial" version of the PMF function. Here "bi" means two and "nomial" means named type (or category). So this PMF function is designed to produce probabilities for "two possibility" events. Multinomial PMF functions would be used in other situations where you have more than two possibilities (like the roll of dice).
CDF stands for cumulative distribution function. Here it calculates the same probabilities as the PMF. All it does different is sum up the probabilities as it computes each possible scenario.
So in the example above when it calculates the probability of getting four defects out of 100 pieces manufactured the output value is the sum of the probability values for zero defects, one defect, two defects, three defects, and four defects. As it turns out when graphed the output values from a CDF makes it pretty easy to see at which scenario you have covered all the most likely scenarios and doing any further tests will just produce slightly different values.
The output from the code above looks like this.
Different questions can be answered with the same PMF function above simply by changing the variable values you pass in. Specifically the variables n, p, and k. In the example above the variable k was a list. However any of the three could have been a single value (a list of one) or multiple values.
For example lets say you instead want to run eight trials and each time you use a slightly different configuration of the assembly line. For the first trial you get 3% defects and want to know under that scenario what is the probability of getting zero defects. Then you did another trial run of 100 and got 2% defects and you want to know what is the probability of getting one defect under that scenario, etc. So here you are looking at eight different scenarios each with a different possible outcome and want the probability for each. Same PMF function, yet now we are answering a set of different questions each with one possible outcome instead of a single question with a series of possible outcomes like in the first example. Because these are actually a set of different questions instead of a single question we do not bother using a CDF.
This output distribution (list of possible scenarios each with a probability output value) in the Bayesian language would be called a "posterior". This list of values could then be used as input values for the next question in your workflow and then they would be called a "prior" distribution. For example you might want to use these probability values to ask the question, "Given these input probabilities what are the expected probabilities of getting 0, 100, 200, 300, 400, 500, 600, and 700 defects when producing 20,000 pieces at a time with each of the eight different assembly line configurations?"
The take away from this article is that despite all the convoluted, overly grandiose, or plane strange labels given to everything in the math world they are actually much simpler objects when implementing them in code.
As an aside, You might ask "Where did all these names for things come from?"
It varies. Many seem to have come to mind when someone looked at the values graphed and something in the graph made them think of certain things and so that is how they coined these terms. For example look up "moral" and "immoral" Bayesian graphs... weird nomenclature for sure. There seems to be very little attempt to name things with the intention of easing the transfer of knowledge from one person to the other. For example this story about where the term entropy (measure of expected or unexpected randomness) came from:
“My greatest concern was what to call it. I thought of calling it ‘information’, but the word was overly used, so I decided to call it ‘uncertainty’. When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.'”