登录查看更多内容

Probabilistic Data Separation

Gerald Gibson

Principal Engineer @ Salesforce | Hyperscale, Machine Learning, Patented Inventor

发布日期: 2022年6月17日

Clusters, modes, distributions, categories, sub-populations, sub-signals, mixtures, proportions, ratios, density curve. These concepts are related by a central theme. This theme is the idea of separation boundaries that are the distinction between one "thing" and another "thing". If separation boundaries did not exist, then everything in the universe would be nothing more than random noise.

This concept is behind various processes, algorithms, and tools in statistics, data science, and machine learning. Mixture Models attempt to fit a model to a dataset that contains more than one sub-population by grouping the data points, by commonalities in their attributes, that act as the boundaries between these types. Clustering algorithms do this also. Distribution fitting algorithms like Expectation Maximization does this as well. Modality testing algorithms do the same thing. All the different Categorical (non-regression) machine learning algorithms e.g. Decision Trees are designed to accomplish this task. So in a sense these methodologies are all duplicative in what they are designed to do... separate one set of data points from any others in a larger dataset.

Once the dataset is separated into these sub-categories, then we assign labels to them such as IDs or Names that represent the concept(s) that these data points describe. We also can end up with other descriptors about them such as their proportion percentage relative to the entire dataset and / or the probability percentage that a new data point added to the dataset should be assigned to each of the clusters within. This last point of assigning probability percentages to each cluster is what is meant when the word "prediction" is used in this context. Each sub-population, each distinct distribution mode, get a % value assigned and that percentage predicts how likely a new data point belongs to that grouping in the data. If a cluster has 66% of the datapoints in the dataset assigned to it, because 66% of the datapoints have similar attributes, then the maximum prediction is that there is a 66% chance that a new data point will belong there... and each of the other sub-groups will have percentage probabilities of some smaller amount such as 5% or 14%. The percentage values of all of the sub-groups combined together sum up to exactly 100% so that all possibilities are covered.

When you are wanting to automate the process of doing this type of data separation you need to come up with a process, an algorithm, that can step by step, without human intervention distinguish one data subset from another. Expectation Maximization is one such algorithm that can take in a bunch of data that is all mixed up and figure out what the groupings of similar attributes are.

When reading data science literature about this process you might see these datasets referred to as "matrices" i.e. a Matrix. In reality these are just data tables (rows and columns). The rows represent each individual data point (i.e. things) and the columns are all different attributes about those "things". Expectation Maximization is often used by other more complex algorithms as one part of their multistep process to achieve categorization separation. For example, you can take one of the attribute columns for every row in the data table and run the values in that column through the EM process to determine if there are sub-groupings within that one attribute and what the statistical properties are of those subgroupings (e.g. mean and standard deviation). In a machine learning process that uses EM it might loop over every attribute column and run each one through the expectation maximization process to find these sub-groups and their statistical values.

One negative with EM is that you must tell it how many subgroups (i.e. clusters) exist within the data up front. This means human intervention to plot the data, look at the data, and then tell the EM process how many subgroups it should fit its model to and gather statistics on. If we want to completely automate this process we need a way to get around this by also automatically figuring out how many subgroups there are and then pass this count into the EM function before it starts. In the example code project below we do this with an algorithm called Modality Testing that is implemented in a Python library called UniDip. Essentially, Modality Testing builds a histogram of the data and then finds the peaks in the data which tells it how many "modes" or subgroups there are along with a rough guess about the minimum and maximum values that are common within each one of those "modes".

Given the rough minimum and maximum returned by UniDip we can calculate a rough Mean and Standard Deviation for each mode represented by the peaks.

领英推荐

How to detect drift with Evidently and MLFlow

Coditation 1 年前

The Anatomy Of Data?Science

Eden AI 2 年前

Data Science and Machine Learning Q&A

Onurdesk 4 个月前

These modality testing result values can then be used to initialize the expectation maximization algorithm, which then iterates over the data to fine tune what the actual minimums / maximums and other statistical values are for each "mode" or subgrouping in the data.

Expectation Maximization does two steps over and over in a loop. The first step is called Expectation and of course the second step is called Maximization. In the Expectation step it iterates over each data point and calculates what the probability percentage is that the data point belongs to each subgrouping "mode".

Then after this is done for all data points it separates the data points and assigns them to the mode that had the highest probability calculated for that data point. Assigning mode by highest probability is the "maximization" part. Then it recalculates the Mean and Standard Deviation for each mode using these data points that were assigned to them by highest probability. This causes the distribution density curves you see in the plots above to scoot over and widen or narrow to get a little bit closer to an exact "fit" of the actual data. Fitting to the data as close as possible is what "modeling" in data science is all about.

After enough loops between Expectation and Maximization the model fit gets pretty close to exact. This allows you to then calculate more accurate statistics of various kinds on the data points within each cluster of data.

The code project for the process described above can be seen here on Datacamp.com.

Distribution Mixture Model via Expectation Maximization

When trying to learn data science you run across so many concepts that it is easy to feel like you are drowning in a sea of details. In reality many of these concepts have a root of commonality. Abstracting away the details to find these central concepts and then learning those common themes makes the journey of mastering the world of statistics, data science, and machine learning much more approachable. As you come across new concepts you can then associate (or map) them to these central common themes instead of feeling like you are starting from scratch and climbing an all new mountain ... when really it is just a small hill on the side of the large mountain that is data science.

要查看或添加评论，请登录

Gerald Gibson的更多文章

Chat That App Intro

2023年12月9日

Chat That App Intro

Chat That App is a Python desktop app I created by using ChatGPT from OpenAI to generate classes, functions, etc. that…
ChatGPT + Timeseries Anomalies

2023年8月23日

ChatGPT + Timeseries Anomalies

Over the past five years, I have been transforming my career from software engineering to machine learning engineering.…

2 条评论
Airflow + PostgreSQL + WSL

2023年7月18日

Airflow + PostgreSQL + WSL

Airflow is a software service that provides asynchronous and distributed execution of workflows. There are several…

3 条评论
TensorFlow-GPU + Ubuntu + WSL

2022年12月20日

TensorFlow-GPU + Ubuntu + WSL

This article walks you through the steps I discovered recently for setting up a working environment to create…

4 条评论
Regional and Online Learnable Fields

2022年6月7日

Regional and Online Learnable Fields

Regional and Online Learnable Fields is a type of data clustering algorithm invented in the early 2000's. It was…

1 条评论
Designing an architecture for MLOps

2022年3月10日

Designing an architecture for MLOps

A large part of architecting anything complex (think software, large buildings, aircraft, etc.) is the skill of mental…
Splunk & Datacamp Training

2021年11月19日

Splunk & Datacamp Training

Not a real article. Just a place to host these since the one drive sharing option is not working.
Random, Stochastic, Probabilistic

2021年9月18日

Random, Stochastic, Probabilistic

At the end of the previous article it was mentioned that we would show how, from a computer programming perspective…
Bayesian probabilities visualized 2

2021年8月21日

Bayesian probabilities visualized 2

In the previous article we covered the basics about what some of these words / phrases used in the Bayesian world…
Bayesian probabilities visualized

2021年8月14日

Bayesian probabilities visualized

I once saw an interview of Benoit Mandelbrot in which he described as a child in his math studies he saw shapes in his…

See all articles

Probabilistic Data Separation

Gerald Gibson

Principal Engineer @ Salesforce | Hyperscale, Machine Learning, Patented Inventor

领英推荐

Gerald Gibson的更多文章

社区洞察

其他会员也浏览了

Prospective Analytics - A New Frontier in Data Science?

Why a PCA could be important

Your intuitive guide to interpret SHAP's beeswarm plot

Decision Tree Classification

Path to Data science - Zero to Hero Series 1 - Week1

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

People are catching up with post-deployment data science

How to Model Shared and Local Data Viewpoints using SHACL Ontologies

Identifying Patterns and Trends in Data Science

Understanding KNN Regressor: A Practical Guide for Data Science Applications

领英推荐

Gerald Gibson的更多文章

Chat That App Intro

ChatGPT + Timeseries Anomalies

Airflow + PostgreSQL + WSL

TensorFlow-GPU + Ubuntu + WSL

Regional and Online Learnable Fields

Designing an architecture for MLOps

Splunk & Datacamp Training

Random, Stochastic, Probabilistic

Bayesian probabilities visualized 2

Bayesian probabilities visualized

社区洞察

其他会员也浏览了

Prospective Analytics - A New Frontier in Data Science?

Why a PCA could be important

Your intuitive guide to interpret SHAP's beeswarm plot

Decision Tree Classification

Path to Data science - Zero to Hero Series 1 - Week1

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

People are catching up with post-deployment data science

How to Model Shared and Local Data Viewpoints using SHACL Ontologies

Identifying Patterns and Trends in Data Science

Understanding KNN Regressor: A Practical Guide for Data Science Applications