登录查看更多内容

Bayesian probabilities visualized

Gerald Gibson

Principal Engineer @ Salesforce | Hyperscale, Machine Learning, Patented Inventor

发布日期: 2021年8月14日

I once saw an interview of Benoit Mandelbrot in which he described as a child in his math studies he saw shapes in his mind instead of a maze of formulas. When I was young I was interested in writing computer code instead of math, however I also preferred to see my code as objects in my mind that moved around and plugged into other objects. I now suspect what we were doing is mapping complex concepts into simpler abstractions that makes "thinking exercises" go more quickly since we did not constantly need to interpret the complex ideas when trying to build solutions from these components.

More recently I have been deep diving into the world of statistics math and I have spent much of my time reading the same subjects over and over from different authors to try to find these mappings from complex ideas to simpler shapes I can see and juggle in my mind.

A year or so ago I worked with someone by the name of Ed Thayer that turned me onto the idea of Bayesian statistics. He talked in the academically precise language of Bayes. I believe he taught the subject as well which probably explains his handle on the language. I had to constantly try to map in real time as he spoke about posteriors, priors, hypothesis, probability of A given B, etc. to what I imagined he "really meant". I certainly did not grasp everything correctly at the time, however he ingrained in me the desire to master this type of math... and for me that means simplification into something I can see in my mind.

In this article I am going to step through each of these elements of the Bayesian "language" and map them to images on the page and simpler concepts that would not impress a school professor, but will provide the reader with the tools to quickly understand and make use of this way of looking at probabilities. I have never seen this done in any books, tutorials, articles, videos, etc. For people that think like myself, however, I think having these simpler constructs moving around in our minds as we try to dive deeper into this world makes the "real time mapping" of complex concepts go much smoother.

Lets begin with the word "probability". How can we think of this in as simple a manner as possible? I have found that seeing probabilities as just percentages (e.g. 50%) when visualizing a concept and in decimal form when doing the math does the trick. When reading about machine learning predictions it is easy to get caught in the trap of imagining there must be some very complex math constructs involved that gives this power of "prediction". However all we are really talking about are percentages. i.e. "What is the percentage chance of a stock price going up in the next hour?" or "What is the percentage chance this image of an animal is a cat vs. a fox or a dog?"

Now for the word "posterior". This is just referring to the output answer from the calculation.

The word "prior" is usually referring to the the outer most population that the smaller populations within it are calculated relative to. In the examples below this would be the "all marbles population". Prior can also refer to the default or starting probability or probabilities (depending on if you are starting with a single or multiple contexts.)

Next lets discuss the word "population". This just refers to a group of things or a subgroup of things within a larger group of things.

If you have a population of marbles that are 50% black and 50% white, then you have three populations involved. The 100% population that includes all white and all black marbles. Then two subgroup populations that are each 50% in size within the 100% sized population that includes both. The math notation here would be P(M) = 100% = population of all marbles, P(B) = 50% = all black marbles, and P(W) = 50% = all white marbles. P stands for probability (it also makes sense to consider P to mean percentage).

Next up is the word "conditional". You will see this used in "conditional probabilities". This can easily be visualized as a subgroup population within another population. The population that the subgroup population is within can itself be another subgroup population. When you see people talking about "A" being conditional on "B" in Bayesian statistics they are really just talking about relative context.

Here we see that 50% of the white marbles have blue dots on them. The "white with blue dots" marble subgroup population is conditional on the white marble subgroup population.

At the same time the black marbles and white marbles (with or without blue dots) are conditional on the population of all (100%) marbles.

Later we might find out that the population of "all" marbles is not really all marbles in the universe, only my marbles. And there are other populations of marbles such as your marbles and some other person's marbles. These various populations of marbles are themselves subgroup populations conditional on the population of all marbles in the universe.

Now lets go over the seemingly simple word "given". It is just like the word "conditional". It is referring to a relative context. The phrase "What is the probability of A given B is true", is talking about subgroup populations. Normally you will see this written in mathematical notation like this:

P(A | W)

The capital P just stands for probability or by including the parenthesis means to calculate the probability. A is a subgroup population within the larger population W. The | symbol means "given".

Here we see P(A | W) where A is the subgroup population of white marbles with blue dots and W is the subgroup population of all white marbles. All of these measurements are relative to something and this Bayesian language / notation is suppose to always incorporate the context.

Laws of Probability

Here we will discuss three foundational theorems of probability.

Theorem 1: Using P(A & W) where & means AND to get P(A | W)

Theorem 2: Using P(A | W) to figure out P(A & W)

Theorem 3: Using P(A | W) to figure out P(W | A)

Theorem 1

For P(A & W) what are the contexts we are actually talking about? That question seems to be one of the biggest values of using Bayesian statistics. If you do not correctly constrain your thoughts to the right contexts then the calculations you do will be wrong or at least off by some amount.

Notice the | symbol does not appear in P(A & W). This is telling us that the base population context that both A and W are being measured against is the population of All marbles. You might be asking which "All marbles"? Generally the solution you are working on starts off defining the top most "all" population (e.g. my marbles or all marbles in a bucket). You do not normally just assume the "all" population means "the entire universe". If you are working on a problem that is calculating percentages for every marble in the universe you say so at the start.

To get P(A | W) percentage of blue dotted marbles relative to the white marbles context from P(A & W) we simply divide P(A & W) / P(W) where both the numerator and denominator are relative to the all marbles context. So .25 / .50 = 50%.

For this example lets use P(B & W) where B = Black marbles and W = White marbles. Remember probabilities should be thought of as just percentages.

The subgroup population of black marbles is 50% and the subgroup population of white marbles is 50%. If you were to pick a marble at random the percentage chance of it being from the black subgroup population P(B) is 50%, from the white subgroup population is P(W) is 50%, and from the black AND white marble subgroup populations P(B & W) = 0% since there is no overlap between these two populations.

领英推荐

Math in the Real World: Tests, Simulations, and More

Towards Data Science 1 年前

Early adopter version of my book - the metric is the…

Ajit Jaokar 4 个月前

TIDES-009: Nine (9) Data Science Concepts and…

Kalilur Rahman 2 年前

Notice I keep using the phrase "subgroup population". This repetition is to drive home the importance of remembering to keep in mind exactly what contexts you are really calculating the percentages for.

For another example lets use P(A & W) where W = subgroup population of all white marbles within the population of All marbles and A = subgroup population of blue dotted marbles within the population of All marbles also ( | does not appear so defaults to all marbles).

When first looking at this image you might think, "Why does it say 25% instead of 50%?". This is why keeping in mind the context of "subgroup populations" is so important. The answer can be 50% or 25% depending on the context. Here P(A & W) is relative to the context of All marbles. Why? Because it is not a conditional probability where the | symbol (i.e. given) appears.

If it was a conditional probability P(A | W) then the context switches from a base of All marbles to a base context of only the white marbles. Then the percentage of blue dotted marbles is 50% given the subgroup population W = white marbles.

Theorem 2

Here we will show how the percentage for P(A | W) can be transformed into the percentage for P(A & W). As seen in the image above P(A | W) = 50%. When you multiply this by the percentage for the outer context P(W) = 50% you then get the percentage for P(A & W) = 25%. i.e. .50 * .50 = .25. So P(W) P(A | W) = P(A & W).

What is the point of this? It is a tool that can be used to calculate much more complex solutions to questions where you have the answers (percentages) for different contexts and you want to transform those answers back to another specific context (in this case to "all marbles" context from the "white marbles" context). So if you somehow already got the percentage for P(W) = 50% (which is relative to the "all marbles" context) and the percentage for P(A | W) = 50% (which is relative to the "white marbles" context), but the actual answer you need is the percentage relative to "all marbles" you do the transformation math P(W) times P(A | W) to get to P(A & W) (which is relative to the "all marbles" context because it does not have a | symbol).

Notice that P(A | W) and P(A & W) both are referring to the blue dot marbles. It is just that P(A | W) is talking about the percentage of blue dot marbles within the white marbles context and P(A & W) is referring to the blue dot marbles within the all marbles context.

Theorem 3

Now we show how to transform P(A | W) to figure out the percentage for P(W | A). This time we will say that black marbles also have a subgroup population of blue dot marbles (black with blue dots).

If we randomly select a marble from the population of all marbles and we get a blue dotted marble what is the percentage chance it came from the white marbles rather than the black marbles? Here P(A | W) is the percentage chance of getting a blue dotted marble relative to the white marbles subgroup population. And P(W | A) is the percentage chance the subgroup population we selected the blue dot marble from was the white marble group. The math notation for doing this transformation looks like this...

P(W | A) = P(W) P(A | W) / P(A)

P(W) is the percentage of white marbles relative to the all marbles population i.e. 50%.

multiplied by

P(A | W) is the percentage of blue dot marbles relative to the white marbles population i.e. 50%.

divided by

P(A) is the percentage of blue dotted marbles relative to the all marbles population i.e. 50%. (remember in this example blue dotted marbles exist in both the white marble and black marble populations).

.50 * .50 / .50 = 50%

So when you select a blue dotted marble from the all marbles population there was a 50% chance it came from the white marbles population.

Theorem 3 uses "Bayes's Theorem" to compute the answer.

This math was as simple as I could make it. Feel free to replace any of the numbers with different percentages and rerun the math to see that the answers come out correct.

For example if 75% of the white marbles are blue dotted then...

P(W) = 50% (white marbles within all marbles)

P(A | W) = 75% (blue dotted marbles within white marbles)

P(A) = 62.5% (all blue dotted marbles)

P(W | A) = .50 * .75 / .625

P(W | A) = 60%

So if you randomly pick a blue dotted marble from the all marbles population there is a 60% chance it came from the white marbles subgroup population. Also notice that in P(W | A) the A (all blue dotted marbles) subgroup population is itself based on the all marbles population not on just the white marbles population.

Notice that throughout this entire article we did not discuss specific counts of marbles. This math is about calculating percentages of group sizes relative to each other regardless of the counts of things within those groups. It is kind of like a dinner recipe that describes adding two parts of one thing plus three parts of another thing plus five parts of something else. You can scale the amounts for a small dinner or a large dinner, but the "relative parts" stay the same.

Finally the takeaways here are that probabilities = percentages and Bayesian statistics is about calculating probabilities between relative populations and transformations between them.

要查看或添加评论，请登录

Gerald Gibson的更多文章

Chat That App Intro

2023年12月9日

Chat That App Intro

Chat That App is a Python desktop app I created by using ChatGPT from OpenAI to generate classes, functions, etc. that…
ChatGPT + Timeseries Anomalies

2023年8月23日

ChatGPT + Timeseries Anomalies

Over the past five years, I have been transforming my career from software engineering to machine learning engineering.…

2 条评论
Airflow + PostgreSQL + WSL

2023年7月18日

Airflow + PostgreSQL + WSL

Airflow is a software service that provides asynchronous and distributed execution of workflows. There are several…

3 条评论
TensorFlow-GPU + Ubuntu + WSL

2022年12月20日

TensorFlow-GPU + Ubuntu + WSL

This article walks you through the steps I discovered recently for setting up a working environment to create…

4 条评论
Probabilistic Data Separation

2022年6月17日

Probabilistic Data Separation

Clusters, modes, distributions, categories, sub-populations, sub-signals, mixtures, proportions, ratios, density curve.…
Regional and Online Learnable Fields

2022年6月7日

Regional and Online Learnable Fields

Regional and Online Learnable Fields is a type of data clustering algorithm invented in the early 2000's. It was…

1 条评论
Designing an architecture for MLOps

2022年3月10日

Designing an architecture for MLOps

A large part of architecting anything complex (think software, large buildings, aircraft, etc.) is the skill of mental…
Splunk & Datacamp Training

2021年11月19日

Splunk & Datacamp Training

Not a real article. Just a place to host these since the one drive sharing option is not working.
Random, Stochastic, Probabilistic

2021年9月18日

Random, Stochastic, Probabilistic

At the end of the previous article it was mentioned that we would show how, from a computer programming perspective…
Bayesian probabilities visualized 2

2021年8月21日

Bayesian probabilities visualized 2

In the previous article we covered the basics about what some of these words / phrases used in the Bayesian world…

See all articles

Bayesian probabilities visualized

Gerald Gibson

Principal Engineer @ Salesforce | Hyperscale, Machine Learning, Patented Inventor

P(A | W)

Laws of Probability

Theorem 1

领英推荐

Theorem 2

Theorem 3

Gerald Gibson的更多文章

社区洞察

其他会员也浏览了

?? Data Science Roadmap ??

A Fun Introduction to the Concept of Bayesian Statistics

Early adopter version of my book - mathematical foundations of data science - key maths ideas you should know for mathematical foundations of AI

A Somewhat Psychopathic Introduction to Bayesian Statistics

Frequentist and Bayesian Statistics: What They Are and Where They are Heading?

The two cultures paper - a must read paper to understand the maths of data science

Bridging the Gap Between Mathematics, Computer Science, and Statistics

Why is maths for machine learning made to appear so complex?

Shapely Values: Game Theory in People Analytics

Artificial Intelligence No 50: Machine learning v.s. Statistics

P(A | W)

Laws of Probability

Theorem 1

领英推荐

Theorem 2

Theorem 3

Gerald Gibson的更多文章

Chat That App Intro

ChatGPT + Timeseries Anomalies

Airflow + PostgreSQL + WSL

TensorFlow-GPU + Ubuntu + WSL

Probabilistic Data Separation

Regional and Online Learnable Fields

Designing an architecture for MLOps

Splunk & Datacamp Training

Random, Stochastic, Probabilistic